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1.  Objectives 

We  have  now  been  active  for  several  years  in  the  area  of  distributed  systems.  It  has  become 
apparent  to  us  that  this  subarea  of  parallel  programming  or  concurrent  programming  systems  is 
tractable:  a  precise  theory  for  distributed,  message  passing  computations  may  be  developed; 
important  paradigms  can  be  abstracted  and  applied  in  a  number  of  practical  situations  and 
reasoning  techniques  can  be  developed  for  distributed  programs  which  can  also  be  effectively 
employed  in  their  developments. 

We  have  been  active  in  all  these  areas.  Our  goal  is  to  make  distributed  programming, 
conceptually  as  simple  as  sequential  programming.  The  added  burden  of  distribution  could  be 
handled  if  adequate  general  theories  were  available.  We  have  had  experience  in  developing 
specific  algorithms  (deadlock  detection,  knot  detection,  shortest  path  etc.);  theories  specific  to  a 
class  of  problems  and  reasoning  techniques  applicable  to  a  class  of  properties  of  distributed 
programs. 

Our  thrust  of  research  in  the  past  year  has  been  to  move  from  specific  to  general,  by 
abstracting  the  relevant  concepts  from  specific  problems  and  applying  them  to  a  general  class. 
Our  outstanding  contributions  in  the  past  year  are: 

•  development  of  a  genera!  theory  for  studying  computability  issues  in  distributed 

systems, 

•  a  paradigm  for  development  of  very  efficient  stability  detection  algorithms, 

•  extension  of  our  proof  theory  to  encompass  a  wider  variety  of  properties  that  can  be 

proven. 

We  have  continued  our,  very  successful,  work  on  modeling  and  distributed  simulation. 

Our  work  has  attracted  considerable  international  attention.  Professor  K.  M.  Cliandy  was 
incited  to  deliver  the  keynote  address  at  the  ACM  Principles  of  Distributed  Computing 
Conference,  the  premier  conference  for  this  area,  in  1981;  be  was  also  invited  to  give  talks  at 
Stanford,  Cornell,  University  of  California  at  Berkeley,  University  of  Minnesota, 
Pennsylvania  State  University,  IBM  Research  at  Yorktown  Heights,  and  Computer  Society  of 
India.  Professor  J.  Misra  was  made  a  member  of  the  prestigious  International  Federation  of 
Information  Processing  Working  Group  (IFIP  WG  2.3),  member  of  the  Editorial  Board  of  the 
Journal  of  the  ACM;  he  delivered  invited  talks  at  the  University  of  California  at  Berkeley, 
University  of  California  at  Los  Angeles,  Cal  Tech,  University  of  Washington,  IBM  Rcscarcli  at 


Annual  Report,  AFOSR  81-0205B 


o 


December,  1981 


San  Jose,  Xerox  Palo  Alto  Research  Center  and  at  several  workshops. 

We  sketch  the  technical  aspects  of  our  recent  work  in  the  following  pages. 

2.  Work  hi  83-84 

2.1.  A  Computability  Theory  for  Asynchronous  Distributed  Systems 

Sequential  systems  had  a  well-developed  theory,  founded  in  logic  and  developed  by  Church, 
Turing,  and  Godel  and  others  [5],  before  any  sequential  program  was  ever  executed  on  a 
computer.  Existence  of  important  theoretical  results,  such  as  the  unsolvability  of  the  halting 
problem,  guided  programming  practitioners.  The  situation  is  entirely  different  in  distributed 
systems;  there  is  no  well-founded  theory  of  computability  in  distributed  systems.  Ilence 
considerable  effort  has  been  expended  on  solving  problems  which  have  later  been  shown  to  be 
unsolvable;  designing  a  distributed  database  commit  protocol  which  is  robust  for  process  failures 
is  only  one  instance  [3]. 

In  recent  years  a  number  of  results  of  the  following  form  have  appeared  in  the  literature: 
there  is  no  asynchronous  distributed  algorithm  for  solving  problem  P,  where  P  is  a  specific 
problem.  For  instance,  it  is  now  known  that:  it  is  impossible  to  elect  a  unique  leader  process 
from  among  a  set  of  identical  processes  [l|;  there  is  no  symmetric  algorithm  for  solving  the  dining 
philosophers  problem  [6’;  it  is  impossible  to  implement  a  commit  protocol  for  distributed 
databases  in  the  presence  of  even  one  faulty  process  [3j;  there  is  no  solution  to  the  Byzantine 
agreement  problem  in  a  fully  asynchronous  system  |2];  and  no  protocol  exists  for  achieving 
common  knowledge  in  an  asynchronous  system  [4).  In  each  case,  ad  hoc  techniques  have  been 
used  in  proving  these  results.  Absence  of  a  common  computability  theory  for  distributed  systems 
has  hindered  progress  in  this  area.  This  is  only  natural,  since  the  requisite  concepts  have  not 
been  developed  for  the  foundation  of  such  a  theory.  We  contrast  the  situation  with  the  well- 
developed  computability  theory  in  sequential  systems.  The  powerful  notion  of  reducibilily  has 
been  applied  in  sequential  systems  in  showing  several  problems  unsolvable:  if  problem  A  is 
known  to  be  unsolvable  and  problem  /I  is  reducible  to  problem  B  (i.e  if  probh  in  B  can  be  solved 
then  problem  A  can  be  solved),  then  problem  B  is  unsolvable.  Such  an  approach  is  attractive  in 
that  an  entirely  new  proof  ol  the  unsolvability  of  B  is  now  avoided.  We  have  no  notion  of 
problem  reducibility  in  distributed  systems  and  therefore  each  unsolvability  proof  is  entirely  new. 
Secondly,  there  is  no  common,  problem  independent  basis  for  showing  that  certain  classes  of 
problems  are  unsolvable;  there  is  no  commonly  acceptable  halting  problem  for  distributed 
systems. 
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We  have  recently  developed  a  theory  to  help  solve  some  of  these  problems.  The  theory  is 
based  on  a  precise  definition  of  distributed  computations  and  a  number  of  operators  on  these 
computations.  The  operators  are  projections  —  study  the  computation  of  one  or  more  process 
from  the  entire  system  computation  --  prefix  --  one  computation  precedes  another  --  union  and 
intersection  —  union  being  defined  only  for  computations  which  are  prefixes  of  a  common 
computation  --  and  gluing  —  merging  of  two  computations.  Cluing  allows  us  to  conclude 
properties  of  distributed  systems  in  the  following  manner:  if  C^,  are  system  computations 
then  C  —  glue(Cj,C0)  is  also  a  system  computation,  where  the  glue  operation  is  suitably  defined. 
If  C  is  infeasible,  viz.  C  allows  more  than  one  process  to  be  in  their  one  process  to  be  in  their 
critical  sections  simultaneously,  or  more  than  lender  to  be  elected  or  a  commit  protocol  to 
commit  to  two  different  values,  then  we  may  conclude  that  either  C  or  C 2  is  not  a  system 
computation.  We  have  applied  this  technique  to  prove  a  number  of  properties  of  any  algorithm 
for  mutual  exclusion,  electing  leaders  etc.  We  have  also  constructed  very  simple  proofs  of  the 
impossibility  of  process  failure  detection,  robust  commit  protocol,  Byzantine  agreement  in 
asynchronous  systems  etc. 

We  have  applied  this  theory  to  the  study  of  knowledge  and  common  knowledge  (4).  We 
have  derived  very  simple  conditions  for  the  number  of  messages  required  to  establish  or 
disestablish  certain  knowledge  levels.  For  instance,  if  A  knows  that  li  knows  that  C  knows  fact 
/,  where  /  is  some  fact  local  to  c,  then  at  least  2  message  transmissions  will  be  required  in  the 
system  before  c  can  falsify  /,  The  bounds  we  have  derived  are  tight;  they  can  be  used  to  prove 
the  impossibility  of  achieving  certain  knowledge  levels,  such  as  common  knowledge,  because  such 
computations  can  be  shown  to  require  an  infinite  number  of  messages. 

There  have  been  a  large  number  of  intuitive  ideas  and  ad-hoc  results  in  asynchronous 
distributed  systems.  The  goal  of  any  unifying  theory  is  to  abstract  a  certain  kernel  and  provide 
rules  for  derivation  of  the  different  results.  We  believe  that  we  are  working  towards  an  elegant 
theory  with  a  small  kernel  and  a  small  set  of  rules. 

2.2.  A  Paradigm  for  Developing  Efficient  Algorithms  for  Stable  Property  Detection 

It  is  often  required  to  detect  whether  a  system  state  has  achieved  stability,  i.e.  it  is  not 
going  to  change.  Examples  of  such  properties  are  termination,  number  of  tokens  equals  zero 
(assuming  no  process  creates  tokens),  deadlock  in  a  subset  of  processes  etc.  In  fact  many 
important  distributed  algorithms  can  be  best  described  in  which  termination  is  implicit.  Last 
year,  we  developed  an  algorithm  in  "Distributed  Snapshots:  Determining  Global  States  of 
Distributed  Systems"  by  Chandy  and  Lamport,  which  allowed  a  process  to  take  a  snapshot  or 
checkpoint  of  a  distributed  system  during  the  evolution  of  its  computation.  This  effectively 
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solved  all  stability  detection  problems.  However,  we  recently  discovered  that  stability  detection 
does  not  require  taking  snapshots.  There  is  a  general  paradigm  for  stability  detection  in  which 
processes  are  merely  observed  over  overlapping  intervals  of  execution  and  each  process  reports 
presence  or  absence  of  activity  over  its  interval.  We  show  that  if  there  is  a  single  point  common 
to  all  intervals  then  the  system  is  stable  if  every  process  is  inactive  over  the  entire  length  of  its 
interval. 

This  paradigm  has  led  to  the  design  of  a  new  class  of  algorithms  which  are  simpler  to 
program  and  prove  and  which  seem  very  efficient  in  their  execution. 

2.3.  A  Proof  Theory  Based  on  the  Notion  of  Quiescence 

Two  general  classes  of  properties  in  sequential  or  distributed  systems  are  safety  and 
li .  eness.  In  sequential  programs  safety  properties  are  proven  by  postulating  invariants  and 
liveness  properties  —  one  of  which  is  program  termination  --  are  proven  by  showing  that  a  certain 
metric  decreases  in  each  step  of  program  execution.  The  problem  is  much  harder  in  distributed 
programs.  Previously,  we  had  developed  a  proof  theory  for  verification  of  safety  properties  only. 
The  difficulty  with  liveness  is  that  termination  is  not  a  natural  property  for  processes  in  a 
distributed  system;  normally,  a  distributed  program  —  such  as  an  operating  system  --  never 
terminates  During  the  last  year,  we  made  some  major  progress  in  attacking  the  liveness 
problem.  We  identified  a  new  property,  quicecenee ,  for  a  process  in  a  distributed  system  which 
seems  to  be  the  natural  generalization  of  termination  in  a  sequential  process.  Roughly,  a  process 
is  quiescent  if  it  will  send  no  messages  provided  it  receives  no  messages.  This  is  the  most  that  can 
be  derived  from  a  process  because  a  process  by  itself,  cannot  guarantee  that  it  will  receive  no 
inputs.  A  network  is  quiescent  if  all  component  processes  are  quiescent..  We  introduced  a  novel 
technical  idea  which  eliminated  channels  from  quiescence  consideration 

We  have  now  developed  the  proof  theory  where  we  can  (1)  prove  safety  and  liveness  in  a 
unified  framework,  (2)  support  hierarchical  network  structure,  (3)  develop  modular  process  proofs 
and  (4)  construct  proofs  which  directly  map  informal  proofs  into  a  formal  proof  in  our  logic.  We 
are  now  experimenting  with  the  applicability  of  these  ideas  in  various  difficult  distributed 
algorithms. 

2.4.  Other  Work 

We  have  continued  our  work  on  distributed  simulation.  These  ideas,  first  published  in  79 
and  continuously  being  refined  since  then,  have  attracted  wide  international  recognition. 
Professor  Misra  presented  a  l  day  tutorial  on  Distributed  Computing  at  the  I15EE  Fourth 
International  Conference  on  Distributed  Computing  Systems  in  San  Francisco,  California  on  May 
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Update  to  Last  Year's  Annual  Report  (see  attached  -  September  1983) 


Item  3:  Preserving  Asymmetry  by  Symmetric  Processes  and 
Distributed  Fair  Conflict  Resolution 


Status:  New  Title:  Drinking  Philosophers  Problem 

ACM  Transactions  on  Programming  Languages  &  Systems 
October  1984 


Item  4:  A  Distributed  Procedure  to  Detect  AND/OR  Deadlock, 
Status:  Withdrawn  from  consideration 


Item  5:  Detecting  Stability  in  Distributed  Systems, 

Status:  New  Title:  Distributed  Snapshots:  Determining 

Global  States  of  Distributed  Systems 


ACM  Transactions  on  Computing  Systems 
(to  appear) 
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1.  Goal 

The  goal  of  this  paper  is  to  discuss  the  area  of  distributed  computing  in 
an  informal  manner.  I  shall  not  present  theory,  algorithms  or 

experimental  results.  Instead,  l  shall  restrict  attention  to  the  foil., wing 
questions: 

•  What  is  distributed  computing? 

•  Why  should  one  study  distributed  computing? 

•  What  are  the  fundamental  questions  in  distributed  computing? 

The  answers  to  these  questions  will  be  philosophical,  rambling  and 

subjective;  but  1  think  the  answers  have  some  merit. 

2.  What  is  Distributed  Computing? 

2.1.  A  Distributed  System 

We  have  to  present  our  discussion  in  terms  of  a  model  of  a  system.  The 
model  chosen  is  not  important  in  itself.  Wc  could  have  couched  our 

discussion  in  terms  of  other  models.  We  shall  describe  our  model 

informally  and  only  to  the  level  of  detail  necessary  to  make  our  algorithms 
clear. 

A  distributed  system  1)  is  defined  bv  its  set  I*  of  component  processes  and 
set  O  of  directed  channels,  i.e.  I)  =  (P.C).  Let  there  be  N  >  0  processes 
in  P  and  let  them  be  indexed  pj,  l A  channel  c  in  C,  is  directed 
from  a  (single)  component  process  Pj  to  a  (single)  component  process  p^, 
and  the  channel  is  defined  by  c  =  (pj,pj).  Path  channel  has  an  infinite 
buffer.  (Bounds  on  buffer  sizes  are  discussed  later.)  A  process  pj  can  send 
a  message  along  one  of  its  outgoing  channels  (pj.p^)  whenever  it  wishes  to. 
Channels  are  loss-free,  error-free  and  deliver  messages  in  the  order  sent. 
The  state  of  a  channel  (pj,p  )  is  a  queue  of  messages;  the  queue  represents 
the  messages  sent  by  p(  and  not  received  by  pj. 

A  process  pj  in  P  is  specified  by  a  set  of  proems  states ,  an  initial  process 
state  aud  a  set  of  allowable  events.  An  event  is  (1)  an  autonomous  state 
transition,  (2)  a  send  or  (3)  a  receive.  An  autonomous  state  transition  at  pj 
takes  Pj  from  a  process  state  s  to  a  process  state  s’;  the  autonomous  state 
transition  event  is  defined  by  the  pair  (s,s’),  and  this  event  can  occur  at  pj 
only  if  Pj  is  in  process  state  s  immediately  before  the  event.  An 
autonomous  state  transition  at  Pj  does  not  change  the  state  of  any  channel 
or  the  state  of  any  process  besides  Pj's. 


*) 

40 

A  send  event  at  p;  is  the  sending  of  a  message  by  pj  coupled  with  a 
transition  of  Pj’s  stale.  It  is  defined  by  the  states  and  s  and  s’  before  and 
after  the  event,  respectively,  the  message  M  that  is  sent  and  the  process 
that  it  is  sent  to.  This  event  can  occur  at  p-  only  if  p-  is  in  state 
immediately  before  the  event.  This  event  causes  M  to  be  inserted  into  the 
queue  representing  the  state  of  channel  (Pj.Pj).  The  states  of  channels 
other  than  and  processes  other  than  p  are  not  changed  by  the 

occurrence  of  this  event. 

A  receive  event  at  pi  is  the  receipt  of  a  message  by  Pj  coupled  with  a 
state  transition.  It  is  defined  by  the  states  s  and  s’  before  and  after  the 
event,  respectively,  the  message  M  that  is  received  and  the  process  p:  that 
the  message  is  received  from.  This  event  can  occur  at  p:  only  if  pj  is  in 
state  s  immediately  before  the  event  and  M  is  at  the  head  of  the  queue  of 
messages  representing  the  state  of  channel  (Pj.pj);  this  event  causes  the 
deletion  of  the  message  at  the  head  of  this  queue.  The  states  of  channels 
other  than  (pj,p>)  and  processes  other  than  pj  are  not  changed  by  the 
occurrence  of  this  event. 

An  event  may  occur  at  a  process  at  any  time  provided  the  states  of 
processes  and  channels  are  such  that  the  event  can  occur.  The  process  and 
channel  states  may  be  such  that  one  of  many  events  may  occur;  the 
selection  among  the  potential  events  is  noil-deterministic. 

2.2.  A  Distributed  Computation 

A  process  computation  Zj  of  a  process  pj  is  defined  as  a  sequence  of 
allowable  events  <CjpCj, >,...>  at  pj  such  that  the  state  of  before  event 
ejk,  k >  I ,  is  its  state  after  the  previous  event  c^,,  and  pj’s  state  before  the 
first  event,  e^,  is  its  initial  state.  A  process  computation  may  be  finite  or 
infinite,  empty  or  non-empty,  and  is  prefix-closed,  i.e.  if  z-  = 
<(’il’..',<'ik,.”>  a  comlnitation  of  pj  then  every  initial  subsequence 
<Cji,...,Cj|t>  of  is  also  a  computation  of  p(. 

We  define  a  system  computation  using  Lamport’s  ideas  of  partially- 
ordered  events.  A  system  computation  Z  is  a  set  of  component  process 
computations,  Z  =  {zjl  <Li<LN},  such  that  the  channel  rule  and  partially- 
ordered  event  rules  (described  below)  are  satisfied. 

Channel  Rule 

The  k-th  message  received  along  a  channel  in  Z  is  the  k-th  message  sent 
along  the  channel  in  Z,  all  k.  Formally,  let  n^  be  the  number  of  messages 
received  by  pj  from  pj  in  Zj.  Let  m^  be  the  number  of  messages  sent  by  p^ 
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to  p(  itt  Zj.  Then  m-  iijj,  all  i,j.  Furthermore,  the  k-th  message  sent  l>y 
Pj  to  Pj  in  i}  is  the  k-th  message  received  by  pt  from  pj  in  z(,  l^k^tijt,  all 

i.j 

Partially  Ordered  Events  Rule 

The  relation  ->  between  events  is  a  partial  ordering  of  the  events  in 
process  computations  where  ->  is  defined  as  follows: 

e  -  >  e’  if  and  only  if 

1.  e  and  e’  are  events  in  the  same  process  computation  and  e 
occurs  before  e*  in  the  computation  or 

2.  e’  is  the  receipt  of  a  message  and  e  the  corresponding  send  of 
that  message  or 

3.  there  exists  an  event  e*  such  that  e  -  >  e*  ->  e\ 

Graphical  Representations  of  a  Computation 

A  set  of  events  in  a  system  computation  may  be  represented  by  a 
directed  graph  whose  vertices  represent  events.  There  is  an  edge  from  (the 
vertex  representing)  an  event  e  to  an  event  e’  if  and  only  if  either  (1)  e  and 
e’  are  events  in  the  same  process  computation  and  e  immediately  precedes 
e’  in  that  computation  or  (2)  e  represents  the  sending  of  a  message  and  e’ 
the  receiving  of  it.  Figure  1  shows  an  event  graph  for  a  system  with  2 
processes.  The  vertical  lines  represent  process  computation  and  the 
diagonal  lines  represent  messages. 


Figure  2-1:  A  Graph  of  a  System  Computation 
hot  Zj  =  <Ojj,ej.,,...>  be  a  computation  of  p^  and  let  IzJ  be  the  length  of 


zv  A  point  in  process  computation  is  an  integer  k,  0<Tt k <^|/.-[.  Tin* 
process  computation  z((k)  up  to  point  k  in  z^  is  defined  as  the  empty 
sequence  if  k  — 0  and  as  the  sequence  otherwise.  A  point  K, 

in  system  computation  z,  is  a  set  K  =*  {kj|k|  is  a  point  in  l<^i<^N}, 
such  that  (z,(kj)|l<i<N)  is  a  system  computation,  For  example,  in  Figure 
1,  { k |  =  0,  k.>  —  2}  is  not  a  point  in  the  system  computation  because 
{<>,<c.jlte.,o>}  is  not  a  system  computation  since  the  channel  rule  is 
violated  -  there  is  a  receive  event  e.„,  in  z„  for  which  there  is  no 
corresponding  send.  We  shall  represent  a  system  point  {k  fk<i<N)  by 
the  vector  (kt..kN).  Examples  of  system  points  in  Figure  1  are 
(1,2), (2,2), (2,1). 

The  fundamental  difference  between  concurrent  computation  and 
distributed  computing  is  that  a  process  in  a  distributed  system  can  only 
access  information  stored  in  its  memory;  processes  do  not  share  variables 
or  a  clock.  Time  has  no  meaning  in  a  distributed  system;  only  causality 
(i.e.  the  relation  ->  between  events)  has  significance.  '1  he  focus  of  much 
of  the  research  in  distributed  computing  deals  with  the  problem  of  limited 
information:  How  can  a  network  of  processes  cooperate  in  achieving  a 
global  task  when  each  process  has  only  partial  information  about  the  task. 
For  instance,  how  can  the  shortest  path  between  two  processes  in  a 
network  be  determined  when  each  process  only  has  information  about  its 
immediate  neighbors? 

The  focus  of  research  in  the  area  of  concurrent  computations  appears  to 
be  different.  The  fundamental  problem  is  not  limited  information  out 
speed.  Synchronous  solutions  to  problems  (shortest  path,  detecting  cycles 
...)  in  which  multiple  processors  access  common  memory,  are  usually  quite 
different  from  distributed  solutions,  because  even  though  the  problems 
share  the  same  name  (for  example,  shortest  path)  the  assumptions  about 
the  underlying  architecture  are  so  different  that  the  problems  are  indeed 
different. 

Confusion  about  the  two  goals,  (l)  problem-solving  with  limited 
information  and  (2)  maximum  speed  should  be  dispelled  before  attacking  a 
problem.  Thus,  to  solve  the  shortest  path  problem  as  quickly  as  possible 
one  would  not  use  an  architecture  with  one  processor  at  each  vertex  of  the 
graph.  Then  why  study  distributed  computing? 

Why  Study  Distributed  Computing? 

There  are  systems  in  which  the  time  required  for  processes  to 
communicate  is  significant  compared  to  the  time  required  for  them  to 
compute  (carry  out  basic  operations). 


Kxamplcs  include  systems  in  which  processing  power  is  geographically 
distributed.  Practical  distributed  systems  in< lu<l«*  factory  automation, 
transportation  control  (managing  the  flow  of  trains,  car-traffic  in  a  city, 
airplanes)  and  communication  systems  control.  We  must  bear  in  mind  that 
problems  dealing  with  such  systems  are  very  different  from  the  problems  of 
super-computing. 

What  are  the  Fundamental  Problems  of  Distributed 
Computing? 

The  problems  that  1  consider  to  be  fundamental  may  well  be  different 
from  those  that  you  consider  fundamental.  Identifying  the  essential  issues 
is  a  subjective  process.  Nevertheless,  1  believe  that  it  is  worth  our  while  to 
spend  a  great  deal  of  time  arguing  about  what  is  central,  and  what 
peripheral,  before  we  begin  attacking  a  problem. 

I  believe  that  the  problem  of  distributed  computation  is  the  problem  of 
partial  information:  many  processes  cooperating  in  achieving  global  ends 
though  each  process  has  limited  information.  My  biased  view  of 
distributed  computation  leads  me  to  identify  the  following  questions  as 
being  fundamental. 

(1)  How  can  a  process  determine  the  state  of  a  distributed  system  that  it 
(the  process)  is  part  of?  This  is  a  natural  question  stemming  from  our 
viewpoint  that  the  problem  of  distributed  computation  is  the  problem  of 
local-information.  A  process  has  access  to  its  local,  process  state,  i.e.  it  has 
local  information:  how  can  it  get  global  information,  i.e.  the  state  of  the 
entire  distributed  system?  Special  cases  of  this  question  are  practical 
questions  such  as:  “How  can  a  process  detect  whether  a  distributed 
computation  lias  terminated?  How  can  a  process  determine  whether  it  is 
deadlocked?* 

(2)  How  can  one  prove  properties  of  a  distributed  system?  This  question 
is  related  to  the  question  "How  should  a  process  be  specified?* 

A  process  may  be  specified  by  (a)  sets  of  stales  and  events,  (b)  a  p.ogram 
or  (•  )  its  inpul/output  relation.  There  are  advantages  to  each  approach. 

(3)  How  can  processes  cooperate  in  sharing  resources  in  a  fair  manner? 
This  is  also  a  problem  of  local  information.  If  all  processes  had  immediate 
access  to  global  data,  there  are  simple  solutions  to  the  problem  of  sharing. 
“How  can  sharing  be  achieved  when  no  process  has  all  the  relevant 
information?*  That  is  a  much  more  difficult  question. 

(  I)  How  can  processes  cooperate  to  achieve  a  global  task  when  some  of 
the  processes  may  be  faulty?  This  questions  leads  to  the  Bvzantine 
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General's  problem  and  similar  problems. 

The  purpose  of  this  paper  is  (o  start  a  discussion,  the  object  of  which  is 
to  pose  the  right  questions.  However,  we  shall  include  one  answer  to  show 
that  the  answers  may  be  simple. 

3.  Determining  Global  States  [Chandy  and  Lamport] 
Observation  1 

(kj..k^i)  is  a  point  in  system  computation  Z  if  and  only  if  for  all  i.j, 
1  <  i,j  <  N,  the  number  of  messages  received  by  p-  from  p(  in  ?j(kj)  does  not 
cxc  ed  the  number  of  messages  sent  by  Pj  to  Pj  in  z((kj). 

Observation  2 

(kj..kN)  is  a  point  in  system  computation  Z  if  and  only  if  for  all  i,j, 
1  <i,j<N,  there  is  no  message  sent  by  pf  to  Pj  after  the  k-th  event  in  Zj 
which  is  received  by  pj  after  the  kj-th  event  in  Zj. 

States 

The  state  of  a  process  p.  at  system  computation  Z  is  its  state  after  the 
last  event  on  Z [j •)  in  Z.  The  state  of  a  channel  (l»j.Pj)  is  the  sequence  of 
messages  sent  by  P|  to  pj  in  Z  for  which  there  are  no  receives  by  pj  from  p; 
in  Z.  The  global  system  state  at  Z  is  the  set  of  states  of  component 
processes  and  channels. 

The  state  of  a  system  at  point  K  in  Z  is  the  state  at  computation 

Algorithm  to  Determine  Global  State 

The  processes  collectively  define  a  point  (kp.k^)  jus  follows.  Tor  all  i,  |q 
selects  the  k(-th  event.  To  ensure  that  the  kj  selected  correspond  to  a 
system  point  the  processes  send  signals  to  one  another  where  signals  are 
special  messages  which  have  no  effect  on  the  underlying  computation. 
Signals  will  ensure  that  the  kj  meet  the  condition  of  observation  2. 

Signal  Sending  Rule:  p-  sends  a  signal  along  each  outgoing  channel  after 
the  kj-th  event  at  p(  and  before  the  next  (regular)  message  sent  along  the 
channel,  all  i. 

Signal  Receiving  Rule:  kj  must  be  such  that  the  kj-th  event  occurs  before 
the  first  receive  by  pj  along  a  channel  following  a  signal  received  along  that 
channel. 

The  sending  and  receiving  rules  together  ensure  that  no  message  sent  by 


p.  to  |>j  after  tin*  k—t h  event  in  Z-  is  received  by  |»-  after  the  kj-tli  event  in 
Zj  (observation  2). 

The  system  state  at  point  K  is  recorded  as  follows.  Each  process  pt 
records  its  own  process  state  after  the  kj-th  event  on  pj  and  before  the 
k|+l-th  event.  Each  process  records  the  state  of  all  incoming  channels: 
the  state  of  a  channel  (Pj,Pj)  is  the  sequence  of  messages  received  by  Pj 
after  the  k:-th  event  on  it  and  before  p:  receives  a  signal  from  p( 
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Preface 


i 

This  monograph  presents  an  entirely  new  .approach  to  the  problem  of  system 
simulation.  System  simulations  are  typically  carried  out  in  a  sequential  manner:  a 
single  processor  fetches  one  item  from  a  data  structure,  carries  out  one  step  of 
simulation,  (possibly)  updates  the  data  structure  and  iterates  this  process.  'uch 
simulations  are  practical  only  when  the  number  of  events  being  simulated  is  modest. 

Recent  advances  in  computer  and  communication  systems  have  resulted  in 
demands  for  new  tools  for  their  analyses.  Mathematical  modelling  techniques  have  so 
far  proved  inadequate  in  dealing  with  these  systems.  Only  simulation  seems  to  be  a 
viable  alternative.  Unfortunately,  simulation  is  proving  to  be  inadequate,  because  of 
the  sheer  magnitude  of  the  problem.  For  instance,  a  telephone  switch  generates 
roughly  around  100  messages  in  initiating  and  completing  a  local  call  made  by  a 
subscriber.  Large  telephone  switches  can  handle  around  100  calls  per  second.  Thus 
simulation  of  a  telephone  switch  for  15  minutes  of  real  time  requires  the  simulation  of 
nearly  10  million  messages.  Detailed  simulation  of  a  telephone  switch,  even  for  a  15 
minute  interval,  will  require  several  hours  on  a  very  large  uniprocessor. 

An  alternative  is  to  exploit  the  cost  benefits  of  cheap  micro/mini  computers  and 
high  bandwidth  lines,  by  partitioning  the  simulation  problem  and  executing  the  parts 
in  parallel.  Unfortunately  however,  the  typical  simulation  algorithm  does  not  easily 
parti)ion  for  parallel  execution.  An  entirely  new  approach  to  simulation,  for 
multiprocessors,  is  required.  This  monograph  presents  such  an  approach. 

The  text  is  organized  in  5  chapters.  Chapter  1  motivates  the  need  for 
distributed  simulation;  it  gives  a  quick  survey  of  the  system  simulation  problem, 
sequential  simulation  algorithm  and  its  shortcomings.  The  scope  of  the  monograph 
and  a  history  of  distributed  simulation  are  also  included  in  that  chapter.  Chapter  2 
contains  a  detailed  description  of  the  sequential  simulation  scheme.  It  is  shown  why 
this  scheme  cannot  be  readily  parallelized.  Chapter  3  introduces  the  basic  distributed 
simulation  scheme.  This  scheme  is  shown  to  result  in  deadlock.  Several  different 
approaches  for  deadlock  resolution  are  discussed  in  chapter  4.  Chapter  5  contains  a 
summary  and  assessment  of  the  entire  field. 

We  believe  that  distributed  simulation  offers  the  most  promising  approach  to 
speeding  up  simulation.  The  basic  theory  has  been  developed;  it  remains  to 
experiment  with  various  alternative  heuristics. 

This  text  is  mainly  oriented  toward,  (i)  machine  designers,  particularly  for  those 


designing  multiprocessors  for  application  programs  and  (ii)  application  programmers 
and  simulation  practitioners.  The  material  is  largely  self-contained.  Some 
acquaintance  with  simulation  and  distributed  systems  is  helpful  though  not  necessary 
Researchers  in  distributed  software  design  will  find  this  monograph  to  be  useful  in 
that. general  area.  The  reader  w  1  come  away  with  an  appreciation  for  (i)  the  nature 
of  the  simulation  problem  reduced  to  its  barest  minimum  and  (ii)  how  to  approach  a 
problem  for  distributed  solution. 

1  apologize  for  lack  of  concrete  empirical  results.  Some  results,  dealing 
particularly  with  queueing  networks,  exist  but  were  found  to  be  too  problem  specific 
for  inclusion  in  this  monograph. 
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Chapter  1 

Introduction 


1.  An  Overview  of  Simulation 


This  chapter  motivates  the  need  for  distributed  simulation.  It  gives  a  quick 
survey  of  the  simulation  problem,  shortcomings  of  sequential  simulation  methods  and 
an  overview  of  distributed  simulation.  The  scope  of  this  monograph  and  the  history 
of  distributed  simulation  are  also  included. 

1.1.  System  Simulation  Problem 

We  consider  the  problem  of  simulating  physical  systems,  also  called  networks, 
which  consist  of  one  or  more  physical  processes.  Each  physical  process  operates 
autonomously  except  to  interact  with  other  physical  processes  in  the  system.  The 
interaction  is  by  messages.  Contents  of  a  message  sent  by  a  (physical)  process 
depend  upon  the  characteristics  of  the  process  (its  initial  state,  its  rules  of  operation) 
and  the  messages  that  the  process  has  received  so  far. 

We  will  describe  the  problem  and  the  terminology  more  precisely  in  the  next 
chapter.  We  note  that  many  real  systems  can  be  modelled  in  terms  of  processes  and 
messages  as  described  above.  For  example,  a  computer  system  is  one  in  which  the 
CPC,  disks,  memory  and  job  entry  terminals  may  be  thought  of  as  processes;  the 
CPU  may  interact  with  a  disk  by  sending  it  messages  requesting  or  releasing  disk 
space;  a  job  entry  terminal  may  interact  with  the  CPU  by  sending  it  messages,  which 
are  m  fact  jobs  or  tasks  to  be  executed.  Detailed  examples  are  given  in  the  next 
chapter. 

Typical  steps  in  simulation  consist  of, 

1.  starting  with  a  real  system  and  understanding  its  characteristics, 

2.  building  a  model  from  the  real  system  in  which  aspects  relevant  to 
simulation  are  retained  and  irrelevant  aspects  are  discarded, 

3.  constructing  a  simulation  of  the  mode)  which  can  be  executed  on  a 
computer  (simulations,  other  than  computer  programs  will  not  be 
considered  here),  and 

•I.  analyzing  simulation  outputs  to  understand  and  predict  the  behavior  of 
the  real  sy. stent. 

In  addition,  the  mode!  and  the  simulation  must  be  verified  and  may  be  refined 
during  steps  (2)  and  (3),  perhaps  iteratively,  if  they  do  not  meet  the  expectations.  In 
this  monograph,  we  look  at  only  one  step  -  step  (3)  of  the  entire  simulation  process 
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Wli.it  is  typically  called  a  model  in  step  (2)  is  actually  our  physical  system;  we  show 
how  to  go  from  a  physical  system  to  a  computer  program  for  simulation  of  this 
system,  which  in  this  case,  is  distributed  and  hence  may  be  concurrently  executed  on 
several  machines.  We  will  not  consider  the  problem  of  constructing  a  physical  system 
description  from  the  real  system,  nor  do  we  consider  how  to  analyze  simulation 
outputs  to  predict  the  behavior  of  the  real  system.  Stated  another  way,  we  show  how 
to  construct  an  asynchronous  system  (the  simulator  running  on  asynchronous 
machines)  from  a  synchronous  system  (the  physical  system,  running  in  real  time). 
We  will  further  restrict  ourselves  to  discrete  event  simulations,  we  assume  that  events 
-  in  our  case,  message  exchanges  -  happen  at  discrete  points  in  time. 

Traditional  Approach  to  System  Simulation 

Traditionally,  discrete-event  system  simulations  are  done  in  a  sequential 
manner.  A  variable  clock  holds  the  time  up  to  which  the  physical  system  has  been 
simulated.  A  data  structure,  called  the  event-list,  maintains  some  message 
transmissions  with  their  associated  times  of  transmissions,  which  are  scheduled  in  the 
future.  Kaeh  of  these  messages  is  guaranteed  to  be  sent  at  the  associated  time  in  the 
physical  system,  provided  the  sender  receives  no  message  before  this  message 
transmission  time.  At  each  step,  the  message  with  the  smallest  associated  future  time 
is  removed  from  the  event-list,  and  the  transmission  of  the  corresponding  message  in 
the  physical  system  is  simulated.  Sending  this  message  may,  in  turn,  cause  other 
messages  to  be  sent  in  the  future  (which  then  are  added  to  the  event-list)  or  cause 
previously  scheduled  messages  to  be  cancelled  (which  are  removed  from  Mm 
event-list).  The  clock  is  advanced  to  the  time  of  the  message  transmission  that  was 
just  simulated. 

/This  form  of  simulation  is  called  ex'ent  driven,  because  events  (i.e.  message 
transmissions)  in  the  physical  system  are  simulated  chronologically  and  the  simulation 
clock  is  advanced  after  simulation  of  an  event  to  the  time  of  the  next  event.  There  is 
another  important  simulation  scheme,  time  driven  simulation,  in  which  the 
clock  advances  by  one  tick  in  every  step  and  all  events  scheduled  at  that  lime  are 
simulated.  We  will  not  discuss  time  driven  simulation  in  this  monograph.  We  will, 
furthermore,  assume  that  all  events  are  discrete,  which  is  certainly  true1  for  any 
system  which  can  be  modelled  as  a  message  passing  system. 

Drawbacks  of  Sequential  Simulation 

The  nature  of  the  event-list  mechanism  dictates  a  sequential  simulation,  since  in 
each  cycle  of  simulation,  only  one  item  is  removed  from  the  event-list,  its  effects 
simulated  and  the  event-list,  possibly,  updated.  This  is  unfortunate,  because  this 
algorithm  cannot  be  readily  adapted  to  concurrent  execution  on  a  number  of 
processors,  since  the  event-list  cannot  be  effectively  partitioned  for  such  executions. 
Wo  contend  that  a  major  bottleneck  to  the  growth  of  widespread  simulation  is  the 
sequentiality  inherent  in  the  event-list  structure.  Increasingly  complex  computer  and 
communication  systems  of  the  future  will  be  intractable  mathematically  and  therefore 


will  have  to  resort  to  simulation  for  their  performance  evaluations.  Current 
simulation  techniques  will  prove  to  he  inadequate  for  these  systems  because  with 
current  technology  only  a  modest  number  of  events  can  be  simulated.  A  radically 
new  approach  to  simulation  must  be  taken  which  will  utilize  the  power  and  oM 
bene/its  of  small  computers  and  high  bandwidth  communication  lines 

1.2.  Distributed  Simulation 

This  monograph  presents  a  radically  different,  approach  to  simulation.  Shared 
data  objects  such  as  the  clock  and  the  event-list  are  discarded.  In  fact,  there  are  no 
shared  variables  in  our  algorithm.  We  suggest  an  algorithm  in  which  one  machine 
may  simulate  a  single  physical  process;  messages  in  the  physical  system  are  simulated 
by  message  transmissions  among  the  machines.  The  synchronous  nature  of  tin- 
physical  system  is  captured  by  encoding  time  as  part  of  each  message  transmitted 
between  machines.  We  show  that  machines  may  operate  concurrently  as  long  as 
their  physical  counterparts  operate  autonomously;  they  must  wait  for 
message  receptions  to  simulate  interactions  of  the  corresponding  physical  processes. 

Distributed  simulation  offers  many  other  advantages  in  addition  to  possible 
speed-up  of  the  entire  simulation  process.  It  requires  little  additional  memory 
compared  to  sequential  simulation.  There  is  little  global  control  exercised  by  any 
machine.  Simulation  of  a  system  can  be  adapted  to  the  structure  of  the  available 
hardware;  for  instance  if  only  a  few  machines  are  available  for  simulation,  several 
physical  processes  may  be  simulated  (sequentially)  on  one  machine. 

Several  distributed  simulation  algorithms  have  appeared  in  the  literature.  They 
all  employ  the  same  basic  mechanism  of  encoding  physical  time  as  part  of  each 
message.  The  basic  scheme  they  use,  may  cause  deadlock.  Different  distributed 
simulation  algorithms  differ  in  the  way  they  resolve  the  deadlock  issue.  Several  new 
algorithms  for  distributed  deadlock  and  termination  detection  have  been  discovered 
in  the  last  few  years.  Combining  these  algorithms  with  the  basic  distributed 
simulation  mechanism  is  expected  to  result  in  very  efficient  and  practical  simulation 
schemes.  Empirical  investigations  are  currently  under  way  to  assess  the  performance 
of  different  schemes. 

1.3.  Scope  Of  This  Monograph 

This  monograph  is  a  comprehensive  survey  of  all  known  (to  the  author) 
distributed  simulation  schemes.  In  order  to  make  the  monograph  self-contained, 
basic  notions  of  sequential  simulation  are  introduced  and  explained  in  Chapter  li.  A 
proof  that  the  sequential  simulation  algorithm  works  correctly  is  given  in  that 
chapter;  surprisingly,  the  author  could  not  find  such  a  proof  in  any  simulation  bonk. 
Chapter  3  introduces  a  basic  distributed  simulation  scheme  and  shows  its  partial 
correctness.  It  is  shown  why  the  basic  scheme  may  be  inadequate,  i.e.,  may  result  m 
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deadlock.  Different  deadlock  resolution  schemes  proposed  in  the  literature  are 

presented  in  chapter  I.  A  summary  of  current  results  and  possible  directions  for 
future  investigations  are  outlined  in  chapter  5. 

This  monograph  does  not  introduce  a  new  simulation  language,  because 
distributed  simulations  can  be  written  using  sequential  simulation  languages  for 
simulating  the  physical  processes  and  message  communication  languages  for 

describing  interactions  among  component  machines.  We  also  avoid  a  number  of 
traditional  issues  in  simulation:  pseudo-random  number  generation,  statistical 

analysis  of  the  outputs  etc.  Methods  developed  in  these  areas  for  sequential 

simulation,  see  |l  l),  still  apply.  Our  goal  in  this  monograph  is  to  show  how  the  body 
of  actual  simulation  can  be  distributed  among  a  set  of  interacting  machines. 

1.4.  History 

Sequential  simulation  has  a  long  history;  Franta  J15)  provides  a  discussion  of  a 
number  of  prominent  simulation  languages  and  their  relative  merits.  Among  the 
many  simulation  packages  introduced  recently,  we  mention  DEMOS  [!>],  SAMOA  [21] 
and  MAY  [2].  DEMOS  is  a  discrete  event  modelling  package  implemented  in 
SIMULA  [id].  It  provides  an  extensive  list  of  features  for  event  scheduling,  data 
collection  and  report  generation.  SAMOA  uses  Ada  [1]  as  the  base  language.  May  is 
based  on  FORTRAN  IV  and  provides  the  minimum  number  of  constructs  needed  to 
carry  out  simulation;  these  features  have  been  used  to  build  an  extensive  library  for 
data  collection,  output  analysis  and  report  generation.  The  minimality  of 
MAY  makes  it  possible  for  it  to  bo  implemented  even  on  personal  computers. 

The  idea  of  distributed  simulation  was  first  proposed  by  Ohandy  in  P)77  in  a 
series  of  lectures  at  the  University  of  Waterloo;  these  ideas  were  later  refined  and 
published  by  ('handy  and  Misra  [7J  and  G'handy,  Holmes  and  Misra  [x|.  They 
observed  that  the  basic  scheme  of  time  encoding  may  lead  to  deadlock  and  th<  v 
proposed  schemes  for  deadlock  avoidance.  Independently  R  IO.  Bryant  [ti]  discovered 
the  basic  simulation  scheme.  Peacock,  Wong  and  Manning  [21,2r*j  and  Holmes  j|7j 
proposed  mechanisms  for  avoiding  deadlock  by  periodic  use  of  probe,  tncs.sui/es. 
Empirical  work  bv  Peacock,  Wong  and  Manning  has  shown  that  the  method  is 
indeed  viable:  the  time  needed  for  simulation  of  a  class  of  queueing  networks  steadily 
decreases  when  the  number  of  processors  available  for  simulation  increases. 
Empirical  investigations  by  Seethalakshmi  [28]  and  Quinlivan  [2t>]  showed  that  the 
method  is  also  efficient  for  acyclic  physical  systems  and  that  performance  can  be 
substantially  improved  if  there  are  many  buffer  spaces  between  machines  for 
buffering  messages. 

('bandy  and  Misra  [0]  have  subsequently  suggested  a  scheme  for  deadlock 
detection  and  recovery.  Reynolds  [27|  suggested  using  common  memory  among 
neighbors  to  avoid  deadlock.  A  notable  departure  from  these  schemes  is  the  one 


proposed  l»y  Jefferson  and  Sowizral  [IN].  'I'hey  surest,  that  a  machine  should  wait  to 
receive  messages  for  a  certain  period  of  time;  if  it  receives  no  messages  from  some 
machine  in  that  period,  it  assumes  that  there  will  he  no  further  messages  from  that 
machine,  and  it  then  continues  simulation  under  this  assumption.  In  case  a 
message  is  received  from  some  machine,  which  violates  this  machine's  previous 
assumption,  it  rolls  back  to  its  previous  state  and  sends  ■  antimessages  •  cancelling  the 
messages  it  may  have  sent  under  the  false  assumption.  iCmpirical  i  vestigation  of  the 
behavior  of  this  algorithm  is  continuing. 

Mezivin  and  Hubert  [d]  propose  an  approach  similar  to  Jefferson's.  In  their 
approach,  each  process  in  the  simulator  maintains  a  local  time  and  an  overall  glottal 
time  is  maintained  by  a  central  process.  Christopher  et.  al.  [1*2]  propose 
precomputing  minimum  wait  time  along  all  paths  in  a  network  m>  that,  delay 
information  may  be  propagated  rapidly  among  non-neighboring  processes.  Practical 
simulation  results,  employing  many  processors,  have  been  reported  in  [‘2d, *2!)]. 

Dev  Kumar  [‘20]  has  combined  some  recent  work  in  deadlock  and  termination 
detection  [22]  with  the  basic  simulation  scheme,  lie  has  developed  algorithms  which 
are  more  hierarchically  structured,  llis  scheme  has  parameters  to  control  the  number 
of  overhead  messages.  Behaviors  of  these  algorithms  on  a  wide  class  of  practical 
simulation  problems  are  currently  being  investigated. 
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Chapter  2 

Sequential  Simulations  of  Systems 


2.  Sequential  Simulations  of  Systems 


This  chapter  introduces  the  problem  of  system  simulation.  A  precise  definition 
of  simulation  is  given.  The  sequential  simulation  algorithm  using  the  event  list 
structure  is  presented  and  proved.  It  is  shown  why  the  sequential  simulation  .scheme 
cannot  be  readily  adapted  for  parallel  execution. 


2.1.  Physical  Systems 

We  consider  physical  systems,  also  called  networks,  consisting  of  a  finite 
number  of  physical  processes  (abbreviated  as  pp’s).  Each  pp  represents  some 
component  of  the  real  system  to  be  simulated.  For  instance,  in  a  computer  system, 
the  CPI’,  each  disk,  each  memory  bank  and  each  job  entry  terminal  may  be  thought 
of  as  a  pp.  A  pp  usually  interacts  with  other  pp’s  from  time  to  time.  In  traditional 
simulation  terminology,  events  happen  at  a  pp,  and  the  occurrence  of  an  event  at  one 
pp  may  cause  other  events  to  happen  at  various  other  pp’s.  We  will  use  a  slightly 
different  terminology  in  this  monograph  which  makes  our  description  of  the 
simulation  algorithms  considerably  simpler. 

Events  that  are  local  to  a  pp,  i.e.,  those  which  do  not  directly  affcci  the 
behaviors  of  other  pp's  directly  may  be  simulated  locally  as  part  of  the  simulation  <>f 
lhi‘  pp.  Any  event,  that  causes  events  to  happen  at  other  pp’s  may  be  modelled  by 
transmitting  a  message  whose  reception  causes  the  desired  event  to  happen.  For 
instance,  if  event  e.  at  pp  1  causes  event  e,_,  to  happen  at  pp  2,  we  can  model  these 
event  dependencies  by  (l)  pp  l  sending  a  message  to  pp  2  and  (2)  pp  2  causing  event 
e.,  to  happen  locally,  at  a  proper  time  after  the  receipt  of  the  message.  Event  e„  may 
cause  an  event  et  to  happen  at  pp  3,  in  which  case  it  must  also  be  modelled  as  a 
message  transmission  between  pp  2  and  pp  3.  An  event  at  a  pp  which  causes  events 
to  happen  at  several  other  pp’s  must  be  modelled  as  several  message  transmissions 
among  a  number  of  pp’s. 

We  next  give  an  example  which  clarifies  the  relationship  between  events  and 
messages.  The  reader  is  urged  to  study  this  example  because  it  shows  explicit 
message  transmissions  between  pp’s  which  were  not  in  the  original  description  of  the 
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Example  2-1  (Car  Wash) 

The  following  example  is  a  variation  of  one  appearing  in  [lj.  A  car  wa-li 
system  consists  of  an  attendant  and  two  car  washes,  abbreviated  (AVI  and  C\Vl 
Cars  arrive  at  random  times  at  the  attendant.  The  attendant  directs  cars  to  CWi  .>r 
(AV2  according  to  the  following  rule:  if  both  car  washes  are  busy,  i.e.  washing  car-, 
any  arriving  car  is  queued  at  the  attendant;  if  exactly  one  car  wash  is  idle,  the  car  at. 
the  head  of  the  queue,  if  any,  is  sent  to  that  idle  car  wash;  if  both  car  washer,  are  idle, 
the  car  at  the  head  of  the  queue,  if  any,  is  sent  to  CYVl.  CWi  spends  <S  minutes  and 
( W_\  10  minutes  in  washing  a  car.  (liven  some  distribution  of  ear  arrivals,  it  i> 
required  to  compute  the  average  amount  of  time  a  car  spends  at  the  ear  wa-.li 
(including  the  washing  time)  and  the  average  length  of  the  queue  that  builds  up  at 
the  attendant.  We  will  not  compute  the  above  statistics;  we  will  simply  show  tin- 
sequence  of  events  and  message  transmissions  in  two  different  views  of  the  ear  wash 
problem. 

The  schematic  diagram  of  the  flow  of  cars  is  given  below. 


cars  eater 
c  a  r  w  a  s  h 


cars  leave 
car  wash 


attendant 


Figure  2-1:  Schematics  of  Car  Flow 

Initially  both  OWl  and  0W2  are  idle.  Assume  that  (>  ears,  Cl  through  (  fi, 
arrive  at  the  attendant  at  times  3,8,9, 1  1,1b, 22.  An  event  in  this  system  is  either  a  car 
a:  r  i  v  ing  at  some  point,  i.e.,  at  the  attendant,  (AVI  or  ( AV2,  <»r  a  car  leaving  the  <ar 
wadi.  We  assume  that  the  driving  time  from  the  attendant  to  (AVI  or  (AV2  is  z>  ro 
A l*o,  when  a  car  arrives  at  (AVI  or  (A\2,  it  starts  getting  service  immediately.  Tie- 
chronological  sequence  of  events  is  given  in  Fable  2-1. 

An  event  e ,  depends  directly  on  an  event  Oj,  if 

1.  Cj,e._,  both  happen  at  the  same  process  and  e,  happens  before  e.,f  or 

2.  I’pCj  happen  at  different  processes  and  Cj  is  one  of  the  causes  of  e.„  i.e  if 
e(  had  not  happened,  e.,  would  not  have  happened. 
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Event  c  is  dependent  on  event  o’,  if  e  depends  direetly  on  <•',  or  e  depends  on 
e“,  which  depends  directly  on  o’.  Two  events  are  independent  if  they  are  not 
dependent  on  each  other.  It  follows  then  that  independent  events  can  be  simulated  m 
parallel,  while  dependent  events  must  be  simulated  in  sequence. 


Event  Number  Time 


1  3 

2  3 

3  8 

-1  8 

5  0 

l>  II 

7  11 

8  11 

0  10 

10  18 

11  18 

12  19 

13  19 

11  22 

13  27 

10  27 

17  28 

18  33 


Event 

Cl  arrives  at  the  attendant 
Cl  arrives  at  CWl 
C2  arrives  at  the  attendant 
C‘2  arrives  at  CVV2 
C3  arrives  at  the  attendant 
Cl  leaves  car  wash 
C3  arrives  at  CWl 
Cl  arrives  at  the  attendant 
C3  arrives  at  the  attendant 
C2  leaves  car  wash 
C  l  arrives  at  CW2 
('3  leaves  car  w;isli 
C5  arrives  at  CWl 
CO  arrives  at  the  attendant 
C5  leaves  car  wash 
CO  arrives  at  CWl 
CM  leaves  car  wash 
C6  leaves  car  wash 


Table  2-1:  A  Sequence  of  Events  in  the  Car  Wash 

‘  Dependencies  among  events  is  shown  in  the  directed  graph  of  Figure  2-2;  an 
edge  from  event  e}  to  event  e.j  denotes  that  event  e2  depends  directly  on  event  e^ 
therefore  e,  must  be  simulated  before  e.>. 


Figure  2-2:  .Schematics  of  Events  in  A  Car  Wash 
Two  independent  events,  such  as  event  8  (Cl  arrives  at  the  attendant)  and 
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event  12  (03  leaves  oar  wash)  are  independent  and  hence  can  be  .simulated 
simultaneously. 

NVe  now  present  the  car  wash  viewed  as  a  message  passing  system.  The  car 
wash  system  has  5  processes:  the  source,  which  generates  cars  at  the  prescribed 
times,  the  attendant,  CWl,  CW2  and  the  sink  (exit).  The  schematic  diagram  of 
message  communications  among  these  processes  is  given  in  Figure  2-3. 


Figure  2-3:  Schematics  of  Message-Flow  in  the  Car  Wash  System 

Note  that  we  have  possible  message  flow  paths  from  CWl  and  CW2  to  the 
attendant.  This  is  because  the  attendant  must  know  when  a  car  wash  becomes  idle 
(In  fliis  particular  problem,  the  attendant  can  keep  track  of  the  times  at  which  he 
sent  the  last  cars  to  CWl  and  CW2,  and  since  the  washing  times  are  fixed,  he  can 
deduce  the  times  at  which  CWl  and  CW2  will  next  become  idle.  This  means  that  the 
attendant  is  simulating  CWl  and  CW2.  In  general  it  will  not  be  possible  nor 
preferable  to  do  so).  The  attendant  expects  messages  from  CWl  and  CW2  each  time 
they  become  idle.  A  complete  list  of  messages  for  this  example  is  shown  in  Table  2-2 
with  corresponding  event  numbers  from  Table  2-1.  Each  message  has  a  sender,  a 
receiver  and  message  content.  In  our  case,  the  content  is  either  a  car  number  or  the 
status  (idle)  of  a  car  wash. 

This  example  shows  how  to  model  event  interactions  by  message  transmissions. 
In  particular,  if  an  event  at  one  pp  causes  events  to  happen  at  several  other  pp’s,  we 
will  have  to  model  such  event  dependencies  by  several  message  transmissions. 
Secondly,  the  chronological  order  of  simulations  of  events  in  sequential  simulation 
(described  later)  guarantees  that  every  event  simulation  precedes  the  simulation  of 
events  that  depend  upon  it.  Our  approach  in  distributed  simulation  will  dispense 


Message 

Number 

Event 

Number 

Time 

Sender 

Message 

Receiver 

Content 

1 

0 

CW I 

attendant 

idle 

•> 

i  “ 

- 

0 

CW2 

attendant 

idle 

3 

1 

3 

source 

attendant 

Cl 

4 

2 

3 

attendant 

CVV1 

Cl 

5 

3 

8 

source 

attendant 

C2 

6 

4 

8 

attendant 

CVV2 

C2 

m 

1 

5 

9 

source 

attendant 

(3 

8 

6 

11 

CW1 

sink 

Cl 

9 

- 

11 

CW1 

attendant 

idle 

10 

"7 

i 

it 

Attendant 

CVVl 

Co 

11 

8 

14 

source 

attendant 

Cl 

12 

9 

10 

source 

attendant 

C5 

13 

10 

18 

(  AV2 

sink 

C2 

14 

- 

IS 

CW2 

attendant 

idle 

15 

11 

IS 

attendant 

CVV2 

Cl 

10 

12 

19 

(AVI 

sink 

C3 

17 

- 

19 

(  W 1 

attendant 

idle 

IS 

13 

19 

attendant 

(AVI 

C5 

19 

14 

22 

source 

attendant 

Co 

•JO 

15 

27 

C  W 1 

sink 

C5 

21 

- 

27 

(AVI 

attendant 

idle 

10 

27 

attendant 

(AVI 

CO 

23 

17 

28 

CW2 

sink 

C4 

24 

- 

28 

CAV2 

attendant 

idle 

'  2T> 

i 

IS 

35 

(AV I 

sink 

(  0 

*  20 

- 

35 

(AVI 

attendant 

idle 

Table  2-2:  A  Sequence  of  Message  Transmissions  in  the  Car  Wash  System 
with  chronological  simulations  of  events. 

In  summary,  a  pp  may  send  messages  to  and  receive  message's  from  other  pp's 
at  discrete  times.  Message  transmission  delays  are  zero,  i.e. ,  any  message  sent  at  time 
t  is  received  l»v  the  intended  recipient  at  l.  (Ilecall  that  we  are  describing  a  physical 
system,  not  the  computer  system  on  which  the  simulation  is  to  run.)  If  it  is  necessary 
to  model  delays  in  the  real  world  system  (viz.  driving  time  from  attendant  to  a  car 
wash  in  the  last  example),  then  either  the  sender  of  a  message  idles  for  some  lime 
before  sending  the  message  or  the  recipient  of  a  message  idles  for  some  time  after 
receiving  the  message;  another  possibility  is  to  model  the  communication  medium  as 
a  process,  incorporating  the  delay. 

There  are  two  conditions  which  are  met  by  every  physical  system  imaginable: 
realizability  and  predictability ,  which  are  described  next.  We  will  assume  that  both 
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these  conditions  hold  for  all  physical  systems  we  consider. 

Realizability 

.1  message  sent  by  a  pp  at  time  t  may  depend  only  upon  the  mexsatjcn  it  has 
received  up  to  and  including  t.  Realizability  says  merely  that  a  pp  cannot  guess  any 
message  it  will  receive  in  the  future. 

Note  that  we  admit  the  possibility  of  a  message  that  is  received  at  t,  affecting  a 
message  that  is  sent  at  t.  An  example  of  a  pp  in  which  this  instantaneous  cause-effect 
is  seen  is  given  below. 

Example  2-2  (Instantaneous  Message  Transmission) 

Consider  a  pp  which  acts  as  a  merge  point  for  several  pp’s.  Schematically,  such 
a  pp,  A,  is  shown  in  Figure  2-1. 


Figure  2-4:  A  Merge  Foint  pp 

Messages  arriving  at  A,  either  from  the  top  or  from  the  bottom  ar«  instantaneously 
sent  to  the  queue  on  the  right.  Therefore  a  message  sent  by  A  at  t  depends  upon 
messages  received  at  t.  It  may  be  argued  that  pp  A  cannot  be  physically  constructed. 
However  A  might  represent  a  real  world  entity  where  the  interval  between  reception 
and  transmission  of  a  message  is  small  enough  to  be  ignored  altogether  in  the 
modelling  process;  in  fact  pp  A  may  not  even  exist  in  the  real  world  system  and  is 
created,  during  modelling,  to  simplify  description  of  the  real  system.  Such  merge 
points  are  often  used  in  queueing  network  descriptions  of  systems. 
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Predictability 

Suppose  the  physical  system  has  cycles,  i.e.  it  has  a  set  of  processes  pp0, .. ..pj» 
where  ppj  sends  messages  to  pPj  +  1  (and  perhaps  to  other  pp’s)  and  receives  messages 
from  pp;_ j  (and  perhaps  other  pp’s)1.  Suppose  that  the  message,  if  any,  sent  by  pp- 
at  sdme  time  t  depends  upon  what  ppj  receives  at  t,  for  all  i;  then  we  have  a  circular 
definition  where  the  message  received  by  every  pp  at  t  is  a  function  of  itself.  Such 
definitions  lead  to  ■race  conditions"  in  physical  realizations.  In  order  to  avoid  such 
situations,  we  require  that  in  every  cycle  and  for  every  t,  there  is  a  pp  whose  outputs 
(messages  it  sends)  along  the  edge  of  the  cycle,  can  be  determined  beyond  t,  -  up  to 
t+£,  for  some  fixed  £,  S>0  -  given  the  set  of  input  messages  to  it  up  to  l. 

We  next  consider  some  typical  simulation  examples  and  show  that  they  satisfy 
the  realizability  and  predictability  conditions. 

Example  2-3  (Car  Wash  -  Realizability  and  Predictability) 

We  consider  the  car  wash  problem  introduced  in  Example  2-1.  Each  pp’s 
output  at  time  t  depends  only  upon  the  messages  it  has  received  up  to  t.  Of 
particular  interest  is  the  behavior  of  the  attendant.  If  it  receives  an  "idle*  message 
from  either  of  the  car  washes  at  time  t,  and  the  queue  is  not  empty  at  t,  then  it  sends 
a  message  at  t.  Therefore  the  realizability  condition  is  satisfied.  The 
predictability  condition  is  satisfied  because  each  cycle  contains  one  of  CWl  or  (  W2, 
and  given  the  input  to  CWl  (CW2)  up  to  t,  we  can  predict  the  output  from  it  up  to 
t+8  (t  +  10). 

Example  2-4  (Assembly  Line) 

An  assembly  line  consists  of  a  series  of  n  work  stations.  Jobs  enter  the 
assembly  line  at  work  station  1;  when  a  job  has  been  serviced  at  work  station  i  it 
proceeds  to  work  station  (i+1),  i=l,2,...,n-l;  a  job  leaves  the  system  after  being 
serviced  at  work  station  n.  Service  times  at  different  work  stations  are  random 
variables.  There  are  queues  at  stations  where  the  jobs  awaiting  to  be  serviced  by  a 
station  may  be  queued.  A  work  station  takes  one  job  from  its  input  queue  when  it  is 
fri  e,  services  that  job  and  then  sends  it  to  the  queue  of  the  following  work  station. 
All  work  stations  service  the  jobs  in  a  First.-Comc-Kirst-Servod  (FC’FS)  basis.  It  is 
desired  to  find  the  expected  number  of  jobs  in  the  queue  of  each  work  station  and  the* 
expected  waiting  time  for  jobs  at  each  work  station. 

Specifically,  consider  an  assembly  line  consisting  of  3  work  stations,  A,  B  and 
C,  which  services  1  jobs  identified  as  1,2, 3,1.  Schematically,  the  assembly  line  is 
shown  below. 

The  times  at  which  the  source  generates  jobs  and  the  service  time  of  each  work 
1  All  arithmetic  in  pp  subscripts  is  modulo  n. 


IS 


Figure  2-5:  Schematic  Diagram  of  the  Example  Assembly  Line 
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Table  2-3:  Job  Generation  Times  and  Servicing  Times 
station  for  each  job  is  given  in  Table  2-3. 


The  source  (call  it  work  station  0),  the  sink  (call  it  work  station  1)  and  each 
work  station  is  a  pp.  pp  i  sends  messages  to  pp  (i+1),  i=0,l,...,3.  The  source  sends 
messages  (which  represent  jobs)  to  work  station  1  at  times  5,  7,  30  and  32.  If  a  job  j. 
j>l/ arrives  at  a  work  station  at  time  t,  then  its  service  at  this  work  station  begins 
either  immediately  (at  t)  if  the  work  station  is  then  idle  or  it  begins  immediately  after 
the  departure  of  the  (j-l)st  job  from  the  work  station.  Let  Aj  be  the  time  of  arrival 
of  job  j  at  some  work  station,  let  Dj  be  the  departure  time  of  job  j  from  this  w<«rk 
station,  and  let  Sj  be  the  service  time  required  for  job  j  at  this  work  station.  Then 
we  have. 


Dq  ~  ® 

Dj  =  maxlAj.Pj.,)  +  Sjt  j=l,2,... 

Lsing  the  service  times  and  generation  times  of  jobs  given  in  the  previous  table,  we 
can  construct  the  departure  times  from  work  stations,  i.e.,  times  at  which  messages 
are  sent,  as  in  the  following  table. 


Each  work  station’s  output  at  time  t  depends  only  upon  the  jobs  it  has  received 
up  to  t,  and  therefore  the  realizability  condition  is  satisfied.  The 


Source  5  /  30  32 

.  A  9  19  31  37 

B  21  36  38  45 

C  23  39  40  49 

Table  2-4:  Times  at  Which  pp’s  Send  Messages 

predictability  condition  is  trivially  satisfied  since  there  is  no  cycle  in  the  pliy.shal 
system. 

Example  2-5  (A  Computer  Network) 

Imagine  a  computer  installation  that  consists  of  a  CPU  and  2  peripheral 
processors,  procl  and  proc2.  Jobs  enter  the  CPU,  spend  some  time  there  and  then 
branch  to  one  of  the  peripheral  processors  with  some  given  probability.  Upon 
completion  of  processing  at  the  peripheral  processor,  the  job  may  leave  the  system  or 
return  to  the  CPU  with  some  probability.  The  schematic  diagram  of  the  system  is 
shown  Figure  2-6. 
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source  M, 
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Proc2 


Figure  2-0:  Schematic  Diagram  of  Job  Flow  in  a  Computer  System  Which 

Has  a  CPU  and  Two  Peripheral  Processors 

mean  time  between  arrival  of  jobs  from  the  outside  source 
random  variable. 

mean  time  spent  by  a  job  at  the  CPU,  a  random  variable 


•_*(> 


*1= 

mean  time  spent  by  a  job  at  the  peripheral  processor  1  (prod),  a 
random  variable 

t2: 

mean  time  spent  by  a  job  at  the  peripheral  processor  2  (proc‘2),  a 
random  variable 

o: 

probability  of  a  job  going  to  prod 

3: 

probability  of  a  job  exiting  the  system 

merge  points 

Bj.B.v 

branch  points 

This  system  has  pp’s  for  the  source,  the  sink,  merge  points  M.  and  M branch 
points  Bj  and  B.»,  the  (-1*11,  prod,  and  proc2.  Ivacli  message  represents  the  transfer 
,of  a  job  froni  one  pp  to  another.  The  realizability  property  holds,  because  no  pp 
bases  its  behavior  on  anticipation  of  the  future.  Probabilistic  decisions  at  B(.  15., 
cause  no  difficulty  because  the  inputs  to  B.,M.,  up  to  time  t  determine  their  outputs 
up  to  time  t  (though  the  outputs  may  be  different  at  different  times  due  to  the 
probabilistic  nature).  We  can  realistically  assume  that  each  processor  spends  non¬ 
zero  time  in  processing  a  job.  Therefore  the  system  also  has  the 
predictability  property. 

This  concludes  our  discussion  of  modelling  real  world  systems  by  physical 
systems,  i.e.,  a  system  consisting  of  message  passing  processes  and.  operating  in  real 
time  From  now  on,  we  will  assume  that  this  modelling  has  been  performed  and  that 
we  are  dealing  with  physical  systems  with  realizability  and  predictability  properties. 
Now  we  define  the  meaning  of  simulation,  precisely,  for  such  physical  systems. 


2.2.  What  is  Simulation 

We  wish  to  build  a  simulator  or  a  logical  system,  consisting  of  logical 
processes  (abbreviated  Ip),  to  simulate  a  physical  system.  We  will  use  ■simulation* 
in  a  rather  strict  sense:  we  say  that  a  logical  system  correctly  simulates  a  physical 
system  if  it  is  possible  for  tfic  logical  system  to  predict  the  exact  sequence  of  message 
transmissions  in  the  physical  system.  That  is,  if  t, fj...  are  the  times  at  which 
the  messages  are  transmitted  in  the  physical  system  and 

tj <t0...< tj<...,  then  the  logical  system  should  be  able  to  output  the  sequence 
<(t,,m1),(t.2,m.J)  ...  (tj.mj),..^ . 

Wo  observe  the  following  facts  from  the  definition  of  simulation  just  stated. 

1.  The  logical  system  must  be  able  to  determine  the  exact  chronological 
sequence  in  which  message  transmissions  take  place  in  the  physical 


system:  therefore  it  is  not.  acceptable  to  predict  (t,m)  and  then  (t  .in'), 
where  t'<t. 


2.  The  logical  system  may  not  actually  print  the  sequence,  <...(tl,m1)...>. 

All  that  is  desired  is  that  it  should  be  possible,  to  do  so  from  the  logical 
*  system. 

Clearly  a  physical  system  is  a  simulation  of  itself.  We  wish  to  construct  logical 
vv.stems  which  may  not  operate  at  the  same  speed  as  the  physical  system.  Our  goal  i 
to  construct  a  logical  system  out  of  a  machine  or  machines  where  the  speed-  >'f 
processors  and  communication  links  (if  any)  are  arbitrary.  In  other  words,  we  wish  to 
duplicate  the  Oehavior  of  a  synchronous  physical  system  using  asynchronous  logical 
components. 

It  should  be  observed  that  we  can  do  the  typical  functions  of  a  simulation 
-  analyze  data,  predict  performance  or  future  behavior,  generate  reports  ole.  -  l>\ 
using  the  logical  system.  We  do  not  address  these  issues  in  this  monograph  we 
merely  observe  that  since  it  is  possible  to  create  the  sequence  of  physical  m*  sag" 
transmissions  in  the  logical  system,  all  interactions  can  be  reconstructed  and 
analyzed. 

Example  2-ft  (Message  Transmission  in  the  Assembly  Line 
Example) 

A  simulation  of  the  assembly  line  of  Lxample  2-1  should  be  able  to  predict  tin 
following  message  sequence.  This  sequence  is  derived  from  ’Fable  2-1.  In  tin 
following,  a  message  consists  of  (sender  id,  receiver  id,  message  content).  W >■  will 
write  a  l-tuple  (t.s.r.m)  to  denote  that  at  time  t,  process  s  sends  a  message  to  r  with 
content  m. 

i 


<(."*. source. A. I ),  (7, source, A. 2),  (H.A.IU),  (M),A,H,2),  (2I,B,<;,1),  (2d, C, sink. I ). 
(dO.souree.A.d),  (31.A.11.3).  (32, source, A,  I ),  (3t>,B,(.\2),  (37, A, 11,1),  (3X,B,(  ,d). 
(;W.C'.sink,2),  (10, (’, sink, 3),  ( lr>,B,C,l),  ( 19, C, sink,  1)> 


2.3.  The  Sequential  Simulation  Algorithm 

Two  major  data  objects  used  by  the  sequential  simulation  algorithm  are. 
clock  ami  event-list.  Their  meanings  are  given  below. 

clock:  is  a  real-valued  variable.  It  gives  the  time  up  to  which  tin 

corresponding  physical  system  has  been  simulated. 

is  a  set  of  tuples  of  the  form  (t^uq),  where  is  some  time,  t  > 
clock  and  m  is  a  message.  (We  assume  that  the  identities  of  tin 


event-list: 


m  m  m  m  m  m  m 


% 

n 

% 

n 

i 

3 

1 
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sender  and  the  receiver  are  parts  of  the  message.)  A  (tj.rn^  is  in 
the  event-list  means,  if  the.  sender  of  m-  receives  no  further 
messages  between  the  current  lime  (given  btj  dock)  and  tfl  then  it 
sends  no  other  message  between  clock  and  t-,  and  sends  m  at  t 

It  is‘re(|uired  that  fur  every  ppj,  there  must  he  exactly  one  event-list  entry  (t  m  )  in 
which  ppj  is  the  sender.  If  a  pp  sends  no  message  in  the  future,  unless  it  receives 
further  messages,  the  corresponding  event-list  entry  will  he  (oo,m),  where  the  message 
content  in  m  is  arbitrary.  A  similar  entry,  (co,m),  will  always  be  in  the  event-list  f< »r 
a  pp  that  has  terminated. 

Example  2-7  (A  Snapshot  in  Sequential  Simulation  of  the 
Assembly  Line) 

In  simulating  the  assembly  line  of  Example  2-1,  a  possible  value  of  clock  and 
corresponding  entries  in  the  event-list  are  shown  below. 

clock  9 

event-list  U 11, 2), |21, n,C,t),(co,C:, sink, source, A, ;t)} 

This  snapshot  of  the  simulation  corresponds  to  the  point  in  the  physical  system  wlmre 
the  source  lias  produced  jobs  1  and  2  and  job  1  has  been  processed  at  A  and  sent  to 
15.  The  source  has  one  more  job  scheduled  for  production;  A  has  scheduled  to  scud 
job  2  to  li  at  time  10,  provided  A  receives  no  more  jobs  between  9  and  10;  15  has 
scheduled  to  send  job  1  to  ( •  at  time  21  provided  it  receives  no  more  jobs  before  then; 
C  has  scheduled  no  message  because  it  has  received  no  jobs. 

It  should  be  noted  that  each  entry  (t,m)  in  the  event-list  is  conditional,  (t.m) 
may  put  actually  occur  in  the  physical  system  because  this  message  transmission  mag 
be  cancelled  if  the  sender  of  m  receives  a  message  prior  to  t..  In  fact,  one  can 
construct  an  example  where  nearly  all  entries  in  the  event  list  are  cancelled  in  each 
cycle.  There  is  only  one  entry  (t,m),  where  t  is  the  smallest  time  component  of  any 
entry  in  the  event-list,  which  is  guaranteed  to  occur.  (  We  assume,  for  the  moment, 
that  there  is  a  unique  entry  with  the  smallest  value  of  t-component.  The  case  of 
multiple  entries  with  smallest  t-components  is  discussed  below.)  The  sequential 
simulation  algorithm  is  based  on  this  fact.  We  state  and  prove  this  result,  somewhat 
rigorously,  below. 

Simulations  of  Simultaneous  Events 

Two  message  transmissions  that  happen  simultaneously  in  the  physical  system, 
i . c. ,  at  the  same  time  t,  require  that  a  sequential  simulation  simulate  them  in  some 
order  This  can  lead  to  problems,  as  shown  in  the  following  example:  pp  A  plans  to 
send  a  message  m  to  pp  15  at  time  t;  pp  15  is  an  alarm  clock  that  is  scheduled  to  go 
off,  i.c..  send  a  message  in'  to  pp  A,  at  time  t,  unless  it.  receives  a  message  from  pp  A 
before  nr  at  t.  In  the  physical  system,  pp  15  will  not  semi  m'  to  pp  A.  However  if 


these  message  transmissions  arc  simulated  sequentially  in  arbitrary  order,  a  possible 
simulation  may  result  in  pp  B  sending  m’  to  pp  A.  As  the  reader  will  see  later,  ibis 
problem  does  not  arise  in  distributed  simulation.  However  sequential 
simulation  requires  an  ordering  of  simultaneous  events  and  different  orderings  may 
lead,  to  different  simulations.  Certain  sequential  simulation  languages,  such  as 
GPSS  [15],  provide  the  user  with  facilities  for  defining  these  orderings. 

In  proving  theorem  l,  below,  which  forms  the  basis  for  sequential  simulation, 
we  will  skirt  the  issue:  we  will  assume  that,  simultaneous  events  are  independent, 
i.e.,  if  (t,m)  and  (t,m’)  are  both  in  the  event  list  and  t  is  the  smallest  time  component, 
then  both  these  message  transmissions  will  take  place  in  the  physical  system. 
Therefore  these  message  transmissions  may  be  simulated  in  either  order. 

The  Simulation  Algorithm 

The  theorem  stated  below  forms  the  basis  of  the  sequential  simulation 
algorithm. 

Theorem  1:  Let  (t,m)  be  an  entry  in  the  event-list  such  that  t<t.', 
for  every  other  entry  (t’,m’)  in  the  event-list.  Then  the  message  m  is 
transmitted  at  time  t  in  the  physical  system  and  no  message  is  transmitted 
between  the  current  clock  value  and  t. 

Proof:  If  message  m  is  not  transmitted  at  t,  it  must  be  because  some 
other  message  is  transmitted  earlier  than  t  (and  after  clock  value)  which 
causes  the  sender  of  m  to  cancel  transmission  of  in.  Consider  the  first 
message  in’  to  be  transmitted  after  the  current  clock  value;  it  must  be 
transmitted  at  t’  where  clock  <  t’  <  t.  The  sender  of  m’  could  not  have 
received  any  message  between  clock  and  t’,  because  such  a  message  would 
-be  the  first  message.  (t’,m’)  must  be  an  entry  in  the  event-list,  because  the 
/sender  of  nf  sends  its  message,  without  receiving  any  other  message  after 
the  current  clock  value  and  before  t’.  t’  <  t  contradicts  our  choice  of 
(t.m).  Hence  the  theorem. 

The  simulation  algorithm,  given  below,  works  as  follows.  In  each  step,  the 
message  with  the  smallest  associated  time  is  removed  from  the  event-list,  its  effects 
are  simulated  causing  possible  additions  to  and  deletions  from  the  event-list,  ami  the 
clock  is  advanced  to  the  time  associated  with  this  message  transmission.  This 
algorithm  is  given  in  a  pseudo-programming  notation  below. 

Algorithm  for  Sequential  System  Simulation 


Initialize: : 


21 


clock  :=  0;  event-list  :=  <(tj,mi)| 


Iterate: : 


message  m,  will  be  sent  at  t-,  unless 
the  sender  of  a,  receives  a  message 
before  t,;  one  such  entry  e*ists  for 
each  pp  as  the  sender). 


while  teraination  criterion  Is  not  net  do 

remove  any  (t.m)  from  the  event-list  there  t  is  the  smallest 
time  component; 

simulate  the  effect  of  transmitting  a  at  time  t; 

{This  may  cause  changes  in  the  event-list. 

Note  hotever  that  any  addition  (t’.m') 
to  the  event-list  till  have  t’  >  t  and, 
any  deletion  (t'.m’)  till  have  t’>t) 

clock  :=  t 

endwhile 

Tin*  correctness  of  this  algorithm  should  be  obvious  from  our  pro, ions 

discussions.  Note  that  the  sequential  simulation  algorithm  is  capable  of  producing 

the  sequence  of  message  transmissions  in  the  physical  system;  it  simply  prints  (t.m). 

when  it  removes  (t,m)  from  the  event-list.  Furthermore,  this  algorithm  cannot,  in 

general,  choose  to  simulate  more  than  one  tuple  in  any  step,  because  as  we  have 

noted  earlier,  none  id  the  tuples  except  the  one  chosen  by  the  algorithm  may  occur  in 

t he  physical  system. 

* 

Exrymple  2-8  (A  Sequence  of  Snapshots  in  the  Simulation  of  the 
Assembly  Line) 

We  consider  the  assembly  line  example  and  show  a  partial  sequence  of  event- 
lists  and  clock  values. 

clock  event-list  message  with  smallest 

associated  time 

0  <(5,source,A,l),  (.r>, source, A, 1) 

(oo, A,-,-), 

(00, 


<(7,source,A,2), 

(9,A,B,1), 

(oo, 


0 


(7, source,  A, 2) 


i 


(9,A,B,1) 


<(30, source, A, 3), 

(9,A,B,l), 

(oo,B,-,-), 

(oo,C,-,-)> 

0  *  <(30, source, A, 3)  (19,A,B,2) 

(19,A,B,2), 

(21,B,C,1), 

(oo,C,-,-)> 

Notes  on  Parallel  Execution 

It  should  bo  obvious  that,  in  general,  we  cannot  do  much  better  than  processing 
one  tuple  from  the  event-list  at  a  time.  In  order  to  process  more  than  one  tuple,  sav 
at  once  both  (t,m)  and  (t’,m’),  we  must  be  sure  that  these  two  events  are 
independent,  i.e.,  that  execution  of  one  will  not  in  any  way  affect  the  execution  of  the 
other.  This  requires  us  to  know  more  about  the  cause-effect  relationship  among 
messages.  We  consider  these  issues  in  the  next  chapter  in  developing  a  basic  scheme 
for  distributed  simulation,  the  subject  with  which  this  monograph  is  concerned. 


Chapter  3 

Distributed  Simulation:  The  Basic  Scheme 


3.  Distributed  Simulation:  The  Basic 
Scheme 


In  this  chapter,  we  introduce  a  model  of  distributed  computation  and  we  show 
how  a  simulation  may  be  carried  out  by  a  set  of  communicating  processes.  We  limit 
our  discussion  here  to  a  basic  scheme,  one  which  can  result  in  deadlock.  More 
sophisticated  schemes  which  resolve  deadlock  are  discussed  in  the  next  chapter. 

3.1.  A  Model  of  Asynchronous  Distributed  Computation 

A  distributed  system  consists  of  a  finite  number  of  processes  and  dirertc.il 
edges  connecting  some  pairs  of  processes.  To  distinguish  these  processes  from 
physical  processes,  we  call  them  logical  processes  or  Ip’s.  ICach  Ip  is  a  sequential 
process  that  executes  both  its  sequential  code  and  two  special  commands:  receive  and 
send.  In  a  send  command,  an  Ip  names  an  outgoing  edge  and  a  message  that  is  to  be 
sent  along  that  edge.  The  execution  of  the  send  command  results  in  the  message 
being  deposited  on  the  named  outgoing  edge;  the  sender  then  proceeds  with  tin* 
execution  of  its  code.  I'ach  message  takes  an  arbitrary  but  finite  time  to  reach  its 
destination.  Messages,  sent,  along  an  edge,  are  delivered  in  (lie  sequence  in  which  they 
are  sent.  In  a  receive  command,  an  lp  names  one  or  more  incoming  edges  from  any 
one  of  which  it  wishes  to  receive  a  message.  An  Ip  wishing  to  receive  may  have  to 
wait. -until  a  message  arrives  along  one  of  the  edges  that  it  is  waiting  for.  Note  that 
our  Communication  protocol  is  extremely  simple  and  can  be  implemented  readily  on 
many  existing  machine  architectures. 

A  set  of  Ip’s  l)  is  deadlocked  at  some  point  in  the  computation  if  (1)  every  Ip  in 
I)  is  either  waiting  to  receive  or  is  terminated,  (2)  at  least  one  Ip  in  I)  is  waiting  to 
receive,  (3)  if  Ipj  is  in  0  and  is  waiting  to  receive  from  Ipj,  then  Ipj  is  also  in  I).  and 
( I)  there  is  no  message  in  transit  from  Ipj  to  lp-. 

It  follows  then  that  none  of  the  Ip's  in  1)  will  carry  out  any  further  computation 
as  they  will  remain  waiting  for  each  other. 

3.2.  Basic  Scheme  for  Distributed  Simulation 

To  simulate  any  given  physical  system,  we  construct  a  distributed  logical 
system  as  follows.  We  will  associate  one  Ip  per  pp;  Ip-  will  simulate  the  actions  of 
pPj.  There  is  an  edge  from  Ipj  to  Ipj,  if  Ipj  can  send  messages  to  Ipj.  Messages  among 
Ip’s  will  be  transmitted  along  the  edges  connecting  them. 


•JX 


An  1|>  can  simulate  tin*  actions  of  a  pp  up  to  time  t  if  the  Ip  knows  all  messages 
that  the  corresponding  pp  receives  up  to  time  t.  This  is  because,  from  the 
realizability  property,  no  future  message  (message  received  by  the  pp  after  time  t) 
can  affect  the  pp's  behavior  at  t.  We  note  further  that  an  Ip  may  be  able  to  simulate 
a  pjxeven  beyond  time  t  by  knowing  its  input  messages  up  to  time  t,  as  shown  in  tin- 
following  example. 

Example  3-1  (An  Ip  May  Predict  the  Future) 

Consider  a  typical  non-preemptive  First  Come  First  Servo  (FCFS)  server  wlii<  h 
spends  exactly  10  units  of  time  servicing  each  job.  Assume  that  a  job  arrives  at  time 
t  when  this  server  is  idle.  From  this  information  about  input  messages  up  to  time  l, 
we  can  predict  the  behavior  of  the  server  up  to  time  t  +  10:  it  will  produce  no 
output  between  times  t  and  t  -f  10,  but  it  will  output  a  message  at  t  -f  10,  sending 
the  job  that  has  been  serviced  to  its  next  destination. 

From  these  observations,  we  can  construct  an  algorithm  for  distributed 
simulation.  We  note  that  the  times  at  which  pp’s  send  messages  must  be  encoded 
into  the  message  that  the  Ip’s  send.  Thus  if  message  m  is  sail  by  pp  -  to  pp  at  tunc, 
t.  message  ( t,m I  will  be  sent  by  lpi  to  Ipj  at  some  point  during  simulation  and  vice 
versa. 


We  make  a  chronology  requirement:  if  an  lp  sends  a  sequence  of  messages 
<...(ti.mj),(tj+1,mj+1)...>  to  another  Ip,  then  tj  <  tj+J  ...  The  implication  of  this 
requirement  is,  if  lp,  receives  (t,m)  from  Ip-,  then  it  knows  all  messages  that  pp 
receives  from  pp-  up  to  and  including  time  t,  because  any  future  message  will  have  a 
higher  t-component. 

^Define  the  edge  clock  value  of  an  edge  to  be  the  t-component  of  the  last 
message  received  along  that  edge;  the  edge  clock  value  is  0  if  no  message  has  been 
received  along  that  edge.  Clearly,  every  Ipj  knows  all  messages  received  by  the 

corresponding  pp,  up  to  time  T,  =  min  {U,  where  t-'s  are  the  edge  clock  values  of 

j  '  1 

all  incoming  edges  to  that  Ip  and  the  minimum  is  taken  over  all  these  incoming  edges. 
IPj  <tan  thus  safely  simulate  ppj  up  to  T-,  i.e.,  it  can  deduce  every  message  that  pp 
sends  up  to  time  T,.  Also,  lpi  may  also  bo  able  to  deduce  pp,’s  message  transmissions 
beyond  1  j.  In  any  ease,  Ipj  will  send  messages,  corresponding  to  all  the  messages  ji 
can  deduce  for  ppj.  The  basic  simulation  algorithm  followed  by  Ipj  is  sketched  next. 

Algorithm  A 

Basic  distributed  simulation  algorithm  to  be  followed  by  Ip,. 


Initialize: :  T,  :  =  0  {All  messages  received  by  pp,  up  to  T)(  are  no»  kno*n  to  ip,} 

while  simulation  completion  criterion  is  not  met  do 
{simulate  pp,  up  to  T|  by  doing  the  folloting):: 

(or  each  outgoing  edge,  compute  the  sequence  of  messages 

.  (t2.m2)  ...  (tr,rar)>,  there  t,<t2.  .<tr  and,  pp,  sends 
nij  at  time  tj  along  this  edge; 

send  each  message  in  sequence  along  the  appropriate  edge; 

{NOTE:  all  messages  sent  by  pp,  up  to  T,  can  be  deduced 
by  Ip,  and  sent;  also  some  messages  to  be  sent  beyond 
T,  may  be  predicted  by  Ip,  and  sent.  Only  net  messages 
that  have  not  been  sent  before,  are  sent.  Also  note  that 
some  or  all  of  these  message  sequences  may  be  empty.) 

{receive  messages  and  update  T,  until  T,  changes  value)  :: 


while  T,  ’  =  T,  do 

tait  to  receive  messages  along  all  incoming  edges; 

upon  receipt  of  a  message,  update  lp,'s  internal  state  and 
recompute  T,,  the  minimum  over  all  incoming  edge  clock  values 

endwhile; 

endwhile 

Not$:  Those  Ip’s  which  have  no  incoming  edges,  will  be  called  source  Ip’s.  Each 
source  lp  also  follows  this  algorithm  except  that  it  simply  sends  messages  until 
the  simulation  completion  criterion  is  met.  A  sink  Ip  simply  receives  messages 
and  otherwise  does  not  affect  the  simulation. 

Example  3-2  (Distributed  Simulation  of  the  Assembly  Line) 

Let  us  review  the  assembly  line  example  (Example  2-1).  In  the  following,  we 
have  one  Ip  each,  for  the  source,  the  sink,  work  station  A,  work  station  11  and  work 
station  C. 

We  reproduce  Table  2-3  here,  which  shows  the  job  generation  and  processing 

times. 


work 

statlon^^ 

i  Generation 

Source 

5 

7 

30 

32 

Times 

A 

4 

10 

1 

5 

Service 

► 

B 

12 

15 

2 

7 

Times 

C 

2 

3 

1 

4 

Job  Generation  Times  and  Servicing  Times 


The  following  diagram  shows  the  messages  sent  by  each  lp;  an  arrow  from  (t.m) 
to  ( means  that  sending  of  (t.m)  precedes  sending  of  (t’.m’). 

# 

Source:  (5, Source, A, l)  — ►  (7, Source, A, *2)  — ►  (30, Source, A, 3)  — ♦  (32,Soiirce,A,  I) 


1 

l 

1 

1 

A  : 

(0.A.B.1) 

-  ( I0,A,B,2) 

-  (31, A, 13,3)  - 

(37, A, 13,-1) 

1 

I 

1 

I 

B  : 

(21.B, C,l) 

-  (36,B,C,2) 

-  (38,13,0,3)  - 

( 15, B, 0,1) 

l 

l 

1 

1 

C  :  (23,  C,  Sink,  1)  —  (39, C, Sink, 2)  — (40, C, Sink, 3)  —  ( 10,  C,  Sink,!) 


Note  in  this  example  that  the  source  can  send  its  messages  to  A  without  waiting 
for  any  input;  A  can  send  the  i-th  message  to  B  only  after  receiving  the  i-th  message 
from  the  source,  etc.  Two  messages  on  different  Ip’s  between  which  there  is  no 
sequence  of  arrows,  are  independent  and  hence  may  be  transmitted  simultaneously  in 
the  simulator.  For  instance,  (32, Source, A,  1),  (31,A,B,3),  (3G,B,C,2),  (23,C,Sink,l)  can 
possibly  be  transmitted  simultaneously.  The  five  Ip’s  form  a  pipeline  through  which 
each  job  passes.  If  the  speeds  of  the  Ip’s  are  approximately  equal  and  the 
transmission  delays  between  Ip’s  are  approximately  equal  then  the  pipeline  should 
work  at  full  efficiency;  one  job  is  input  and  one  job  is  output  per  cycle  after  an  initial 
delay  of  3  cycles. 

This  is  about  the  simplest  simulation  example  one  can  think  of.  We  study  a 
harder  example,  next. 


Example  3-3  (A  Primitive  Computer  System) 


Figure  3-1:  A  Primitive  Computer  System 

We  have  one  Ip  each  corresponding  to  the  source,  the  CPU,  Prod,  Procli,  M,  It 
and  the  sink.  For  this  example,  assume  that  jobs  arrive  at  the  CPU  from  the 
source  every  5  time  units  starting  at  time  3,  that  jobs  spend  1  unit  at  the  CPU,  that 
jobs  alternately  go  to  Prod  and  Proc2  from  B,  and  that  a  job  spends  5  units  at 
Prod,  18  units  at  Proc2,  and  no  time  at  B  or  M.  We  show  the  sequence  of  messages 
and  their  dependencies  below.  (To  simplify  the  diagram,  we  have  not  shown  the 
arrows  between  messages  at  a  pp.) 


.Source:(3, Source.CPU,  t)(8, Source, CPU, 2)(  13, Source,CPU,3)(  18, Source, CPU, 4)(23, Source, CPU, 5) 

^  i'  v  \l/  \l 

CPU  \  (1, CPU, 0,1)  (9, CPU, 0,2)  (14, CPU, B, 3)  (19, CPU, 0,4)  (21, CPU, 0,3) 

*  \L  \L  v/'  vj' 

0  :  ( 1,0, Prod, 1)  (9,0,Proc2,2)  (14, B, Prod, 3)  (I9,B,Proc2,4)  (21,0, Prod, 5) 


.1 


Prod  :  (9,Procl,M,l)/  (I9,Prod,M,3)  (29,Proclt 
•2  :  ((27Tro.-2.M,2)  U45.Proc2  Al,  1) 


Pro< 


M  :  (O.M.Sink. I )  (19,M.Sink,3)  d27,M,Sink,2)  (29, M, Sink, 5 


Note  the  behavior  of  the  Ip  corresponding  to  M.  Assume  that  it  first,  receives 
(27,Proe2.M,2)  from  the  lp  corresponding  to  Proc2.  This  could  be  entirely  possible  if, 
for  instance,  the  Ip  corresponding  to  Proc2  were  considerably  faster  than  the  one 
corresponding  to  Prod.  Then  the  lp  for  M  can  only  infer  that  it  won't  receive  any 
other  message  from  the  lp  corresponding  to  Proc2  with  time  component  smaller  than 


•27.  However,  it  cannot  assert  anything  about  messages  from  I’rocl;  it  can  thus 
simulate  pp  M  only  up  to  time  0.  Suppose  next  it  receives  (45, Prod!, M,  I);  it  must 
still  wait.  The  next  input  is,  say,  (9, I’roel, M,l).  Then  the  Ip  corresponding  to  M  can 
assert  that  it  knows  all  inputs  that  M  receives  up  to  time  9  and  hence  predict  all  of 
M’s  outputs,  at  least  up  to  9  and  therefore,  it  can  output  (9,M,Sink,l),  since  jobs 
spend  no  time  at  M.  The  rest  of  the  outputs  of  M  are  easy  to  see.  Finally  note  that. 
M  cannot  output  (45, M, Sink,-!)  at  the  very  end,  because  it  does  not  know  if  it  will 
receive  a  message  with  a  t-component  tower  than  15,  from  the  Ip  corresponding  to 
Prod.  An  extra  message  must  be  sent  from  Procl  to  M,  with  t-component  exce<  ding 
15,  to  •flush-out*  this  message.  We  will  discuss  this  issue  later. 

3.3.  Partial  Correctness  of  the  Basic  Distributed  Simulation  Scheme 

Correctness  of  a  distributed  simulation  algorithm  consists  of  two  parts:  (l)  if  a 
message  m  is  transmitted  in  the  physical  system  at  time  t,  then  (t,m)  is  transmitted 
in  the  simulator  and,  (2)  if  (t,m)  is  transmitted  in  the  simulator,  then  message  m  was 
transmitted  at  time  t  in  the  physical  system.  These  statements  are  not  quite  true  in 
the  basic  distributed  simulation  scheme  just  presented.  As  we  observed  in  the  last 
example  ,  job  I  is  sent  at  time  15  from  M  to  the  sink  in  the  physical  system,  although 
the  corresponding  message  is  never  sent  in  the  simulator.  Therefore,  we  can  prove 
only  one  part  of  the  correctness  condition  stated  above:  whatever  is  transmitted  in 
the  simulator  actually  happens  in  the  physical  system.  We  will  postpone  discussion  of 
the  converse  statement-  if  message  m  is  transmitted  at  time  t  in  the  physical  system, 
then  (t,m)  is  transmitted  in  the  simulator  -  to  the  next  chapter. 

Define  a  simulation  to  be  correct  at  some  point,  (1)  if  message  rn  is  sent  at  time 
t  along  edge  e  in  the  physical  system  and  t  is  less  than  or  equal  to  the  edge  clock 
vain?  of  edge  e,  at  this  point  in  simulation,  then  (t,m)  has  been  sent  (  along  edge  e  ) 
in  the  simulation,  and  (2)  if  (t,m)  has  been  sent  in  the  simulation,  then  message  m  N 
sent  at  time  t  in  the  physical  system. 

We  note  that  in  a  simulation  which  is  correct  at  some  point,  every  lp  must  have 
received  a  correct  input  sapience  along  every  incoming  edge,  i.e.,  every  message  on 
this  edge  that  has  been  transmitted  in  the  physical  system  up  to  this  edge  dock 
value,  has  been  received  along  this  edge  in  the  simulation  and  vice  versa.  We  will 
assume  that  every  lp  correctly  simulates  the  corresponding  pp,  i.e.,  if  an  Ip  receives 
correct  input  sequences  along  all  incoming  edges,  then  it  sends  correct  output 
sapiences  along  all  outgoing  edges.  Clearly  a  simulation  is  correct  if  and  only  if  every 
lp  has  sent  correct  output  sequences  along  every  outgoing  edge.  The  following 
theorem  follows,  by  applying  induction  on  the  number  of  messages  transmitted  in  the 
simulation. 

Theorem  1:  Simulation  is  correct  at  every  point. 

Proof;  Simulation  is  obviously  correct,  front  definition,  when  no 

message  has  been  transmitted  in  the  simulation.  Assume  that  a  simulation 
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is  correct  up  to  some  point.  The  next  message  in  the  simulation  is  sent  by 
some  Ipj.  Since  simulation  is  correct  prior  to  this  message  transmission,  Ip( 
has  received  correct  input  sequences  so  far.  From  our  assumption  that  Ip 
correctly  simulates  ppj,  the  output  sequences  of  lpjt  including  the  last 
z  message  sent,  are  correct.  Every  other  Ip  lias  sent  correct  sequences  so  far, 

'  from  the  inductive  hypothesis.  Hence  the  simulation  is  correct  following  the 
last  message  transmission. 

In  a  similar  manner,  we  can  derive  the  following  result. 

Theorem  2:  All  messages  sent  by  one  Ip  to  another  are 
chronological  in  their  time  components. 

3.4.  Features  of  the  Basic  Distributed  Simulation  Scheme 
The  Problem  of  Deadlock 

Theorem  l  tells  us  only  that  whatever  is  transmitted  in  the  simulaior 
corresponds  to  a  message  in  the  physical  system.  As  we  have  noted  earlier,  not  all 
messages  in  the  physical  system  do  got  transmitted  in  the  simulator  using  the  basic 
simulation  scheme.  In  fact,  the  next  example  shows  a  system  in  which  no  message 
gets  transmitted  to  a  subsystem  in  the  simulator. 

Example  3-4  (A  Deadlocked  Subsystem  in  a  Distributed 
Simulation) 


Figure  3-2:  A  Distributed  Simulation  That  Does  Not  Progress 

Consider  a  physical  system  in  which  the  source  sends  messages  to  a  branch 
point  B,  B  routes  the  messages  to  Prod  or  Proc2  from  where,  after  some  finite  time, 
each  message  is  sent  to  a  merge  point  M,  after  which  it  enters  a  network  N  (see 
Figure  3*2).  Consider  the  case  where  B  sends  every  message  to  Procl.  Then  in  the 
simulation,  the  Ip  corresponding  to  M  will  never  receive  a  message  from  Proc2.  Hence 
the  edge  clock  value  for  the  edge  (Proc2,M)  will  remain  at  0  and  the  Ip  for  M  will 
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never  send  a  message.  The  subsystem  N  will  thus  never  receive  a  message.  If  the 
simulation  continues,  certain  parts  of  the  system,  viz.  source  and  prod  will  keep  <,u 
advancing  their  clocks;  however  neither  M  nor  any  Ip  within  N  will  advance  its  clock. 
We  can  claim  that  the  clock  for  no  Ip  in  N  can  progress  beyond  t=0. 

'  We  show  another  example  in  which  the  deadlock  arises  due  to  a  circular 
pattern  of  waiting  among  the  Ip’s. 

Example  3-5  (Cyclic  Waiting  in  a  Distributed  Simulation) 


Figure  3-3:  A  Distributed  Simulation  That  Deadlocks 

Consider  a  network  of  3  processes  and  a  source,  shown  schematically  above. 
The  number  on  each  edge  is  the  edge  clock  value,  i.e.,  the  last  message  sent  from  x  lo 
y  and  received  by  y  bad  a  t-eomponent  of  20  and  so  on.  Suppose  that  none  of  x,y,z 
will  now  send  a  message  unless  they  receive  a  message,  i.e.,  they  can  predict  no 
future  messages. 

,  A  global  observer  can  see  that  z  will  not  send  a  message  unless  x  first  sends  a 
message  to  y.  Hence  x  need  not  wait  for  z;  it  can  process  the  next  message  from  the 
source.  However  none  of  the  Ip’s  corresponding  to  x,y,z  have  this  global  knowledge; 
they  only  have  local  knowledge  of  the  behavior  of  each  individual  pp.  Therefore  x 
cannot  proceed  unless  it  receives  from  z,  z  cannot  proceed  unless  it  receives  from  y 
and  y  cannot  proceed  unless  it  receives  from  x,  leading  to  a  deadlock. 

Simulation  Snapshot 

In  a  sequential  simulation,  it  is  possible  to  assert  that  the  simulator  has 
completed  simulation  up  to  the  time  given  by  the  clock-,  every  pp  must  have  been 
simulated  up  to  ibis  point  in  time.  We  cannot  make  a  similar  statement  for 
distributed  simulation,  because  each  Ip  may  have  simulated  the  corresponding  pp  to  a 
different  point  in  time.  For  instance,  in  the  example  of  the  primitive  computer 
system  (Example  3-3),  we  can  assert  at  the  end  that  the  Ip’s  have  simulated  the 
corresponding  pp’s  as  follows:  (Source  :  23),  (CPU  :  23),  (D  :  21),  (Prod  :  21), 

(Proc2  :  19),  (N1  :  29). 
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We  define  Tj,  the  clock  value  of  lpj,  to  be  the  point  in  time  up  to  which  pp(  has 
been  simulated  by  lpj.  Thus  lpj  has  received  messages  along  all  incoming  edges  up  to 
at  least  Tj  and  has  sent  messages  at  least  up  to  Tj  along  every  outgoing  edge  (i all 
future  messages  it  will  send  will  have  t-components  exceeding  T().  Tj  is  the 
maximum  value  satisfying  the  above  conditions.  Define  T,  the  clock  value  of  the 
simulator,  to  be  the  minimum  of  all  Ip  clock  values.  We  can  assert  that  at  any  point 
in  simulation,  the  physical  system  has  been  simulated  up  to  the  simulator’s  dock 
value,  even  though  some  individual  Ip's  may  have  simulated  the  corresponding  pp's 
far  beyond  T. 

Encapsulation  of  Physical  Processes  by  Logical  Processes 

The  radical  departure  in  the  proposed  scheme  from  sequential  simulation, 
however,  is  the  lack  of  any  global  control.  (We  will  show  deadlock  resolution  without 
global  control  in  the  next  chapter.)  Since  a  pp  is  simulated  eutirely  by  one  Ip, 
various  different  simulations  of  a  pp  can  be  attempted  by  substituting  different  Ip’s 
for  it.  Furthermore,  the  correctness  of  simulation  can  be  checked  one  Ip  at  a  time 
-  the  proof  of  correctness  is  naturally  partitioned  among  Ip’s,  i.e.,  we  show  that  each 
Ip  correctly  simulates  the  behavior  of  the  corresponding  pp.  We  have  shown  that  if 
each  Ip  behaves  correctly,  the  ensemble  as  a  whole  behaves  correctly.  This 
observation  will  lead  to  major  simplifications  in  designing  complex  simulations.  In 
fact,  distributed  simulations  can  be  implemented  using  existing  sequential 
simulations;  instead  of  reporting  to  a  central  event-list  manager,  an  Ip  sends  messages 
and  otherwise  the  core  of  the  simulation  remains  unchanged. 
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Distributed  Simulation:  Deadlock  Resolution 
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4.  Distributed  Simulation:  Deadlock 
Resolution 


We  have  seen  in  the  last  chapter  that  the  basic  distributed  simulation  scheme 
may  lead  to  deadlock  even  in  acyclic  networks.  In  this  chapter,  we  present  several 
different  approaches  to  resolution  of  deadlock.  We  comment  on  some  of  the  most 
viable  approaches  for  deadlock  resolution. 

4.1.  Overview  of  Deadlock  Resolution 

In  all  the  examples  we  have  seen  so  far,  the  simulator  clock  value  (recall  that 
the  simulator  clock  value  is  the  minimum  of  all  Ip  clock  values)  remains  at  some  final 
value  T  forever.  If  T  is  smaller  than  the  point  up  to  which  we  need  to  run  the 
simulation,  we  have  to  apply  some  other  scheme  to  advance  the  simulation. 
Simulations  stop  (other  than  by  conscious  choice)  when  some  Ip  has  more  than  one 
input  edge,  it  can  be  determined  (by  an  external  observer)  that  it  will  receive  no 
more  input  messages  along  some  particular  edge  and  the  Ip  cannot  proceed  further  in 
its  simulation  unless  it  receives  this  information.  For  instance,  in  the  example  of  the 
primitive  computer  system  (Example  3-3),  the  Ip  corresponding  to  M  cannot  proceed 
any  further  unless  it  is  told  that  Prod  will  never  send  it  a  message.  Another 
example  is  Example  3-5,  where  process  x  must  be  told  that  it  will  never  receive  any 
inpur  along  zx  until  x  first  sends  a  message.  The  first  scheme  we  describe,  using  null 
messages  [7],  is  effectively  an  implementation  of  this  idea.  We  will  also  discuss  some 
other  schemes  which  avoid  deadlock  by  using  different  kinds  of  overhead  messages. 

4.2.  Deadlock  Resolution  Using  NULL  Messages 

We  postulate  a  new  kind  of  message  to  be  used  in  the  simulator.  (t,nul!)  sent, 
by  lpj  to  Ipj  means  that  ppj  sends  no  message  to  ppj  between  the  current  edge  dock 
value  of  the  edge  from  lpj  to  Ip- ,  and  t;  therefore  any  future  message  from  lpj  to  Ipj 
will  have  a  t-oomponont  exceeding  t.  Clearly  n ull  messages  have  no  counterpart  in 
the  physical  system.  A  null  message  is  used  to  announce  absence  of  messages. 
Absence  of  messages  in  a  physical  system  at  time  t  is  recognized  by  no  message  being 
transmitted  at  that  time.  Unfortunately,  the  basic  scheme  of  the  last  chapter  cannot 
guarantee  absence  of  messages  to  an  Ip  without  sending  it  an  actual  (non-null) 
message  having  a  higher  t-component  value. 

We  now  propose  modifications  to  the  basic  algorithm  of  chapter  3  to 
incorporate  null  messages.  Let  us  first  review  the  basic  distributed  simulation  scheme 
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of  the  last  chapter.  T;  denotes  the  clock  value  of  Ipj.  Whenever  an  Ip;  receives  a 
message,  it  properly  updates  T(,  and  if  T;  changes  in  value,  then  Ip;  advances  the 
simulation  of  pp;  up  to  Tj.  At  this  point,  Ipj  predicts  for  each  outgoing  edge,  a 
sequence  of  messages  that  the  pp;  would  have  sent.  Thus  Ip-  typically  generates 
<(t^1nij,),(tJ.>,mj.>)...>  for  transmission  to  Ipj,  for  every  j  to  which  it  has  outgoing 
edges.  Some  of  these  sequences  may  be  empty,  in  which  case  no  message  is  sent  to 
the  corresponding  Ip.  Suppose  that  lp;  can  further  predict  that  after  transmission  of 
this  message  sequence  ppj  will  not  send  any  more  messages  to  ppj,  until  time  t ■. 
Then,  in  the  new  proposed  scheme,  lp;  sends  (tj,null)  to  lp^  after  sending  the  genuine 
message  sequence.  Since  lp;  knows  the  state  of  the  corresponding  pp  up  to  time  Tj,  it 
can  predict  all  messages  (that  are  to  be  sent)  and  absence  of  messages,  at  least  up  to 
Tj.  Therefore,  every  outgoing  edge  will  have  a  last  message  on  it  with  time 
component  equal  to  or  greater  than  T;.  Note  that  only  the  last  message  sent  along 
an  edge  may  be  a  null  message,  in  any  iteration. 

Reception  of  a  null  message  is  treated  in  the  same  manner  as  the  reception  of 
any  other  message:  it  causes  the  Ip  to  update  its  internal  state  including  the  clock 
value  and  (possibly)  send  messages. 

Suppose  it  is  required  to  simulate  the  physical  system  up  to  some  time  z.  Then 
every  source  must  send  messages  until  the  t-componont  of  the  last  message  equals  z; 
if  no  non-null  message  exists  with  this  property  then  finally  (z,null)  should  be  sent. 

Example  4-1 

Consider  the  plnsical  system  shown  schematically  in  Figure  4-1,  below. 


Figure  4-1:  A  Physical  System  with  Loop 

We  will  study  the  progress  of  one  possible  simulation  run  of  this  ph  ysical  system. 
The  source  sends  out  jobs  which  are  processed  at  X  for  2  time  units.  Jobs  are  routed 
alternately  to  Y  and  Z  from  B,.  Y  processes  a  job  for  1  unit  and  Z  for  1  units. 
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Kvery  job  loops  through  the  system  twice,  i.e.,  the  first  time  a  job  arrives  at  H.„  it  is 
sent  back  to  Mj  and  on  the  second  arrival  at  U„,  it  is  sent  to  the  sink. 

Table  -1-1  shows  a  succession  of  message  transfers,  where  each  horizontal  row  is 
a  time  slice  and  each  column  corresponds  to  a  single  activity  of  one  of  the  processes. 
Concurrency  is  apparent  because  there  are  several  activities  happening  at  one  time 
slice,  i.e.,  in  one  row. 

*1.3.  Correctness  of  the  Simulation  Algorithm 

The  partial  correctness  results  of  the  last  chapter  still  apply.  The  only 
difference  now  is  the  presence  of  null  messages.  We  define  the  simulation  to  be 
correct  at  some  point,  if  it  is  correct  according  to  the  definition  of  chapter  3  alter 
ignoring  null  messages. 

Theorem  1:  Simulation  is  correct  at  every  point 
Proof:  The  proof  is  almost  identical  to  the  previous  proof  and  hence 
omitted  here. 

The  next  theorem  shows  the  power  of  adding  null  messages:  we  show  that  we  have  a 
deadlock-free  system  which  can  simulate  a  physical  system  up  to  time  z. 

Theorem  2:  Assume  that  every  source  process  sends  messages  until 
the  t-component  of  a  message  equals  z.  Then  every  Ip  will  simulate  the 
corresponding  pp,  at  least  up  to  z. 

Proof:  Consider  the  point  where  the  simulation  terminates,  i.e.. 
where  all  messages  that  have  been  sent  have  been  received  and  no  Ip  has 
any  outstanding  message  to  send.  The  following  observation  is  critical:  for 
every  Ip  (except  a  source  Ip)  there  exists  an  incoming  edge  to  that  Ip  whose 
edge  clock  value  is  less  than  or  equal  to  the  edge  clock  value  of  every 
•'outgoing  edge  from  that  Ip.  This  observation  follows  because:  (1)  an  Ip 
that  has  received  messages  at  least  up  to  t  along  every  input  edge  must 
have  sent  messages  (t’,m’),  t'  >  t,  along  every  outgoing  edge,  and  (2)  every 
message  that  has  been  sent  has  been  received  when  simulation  terminates. 

Note  that  (1)  could  not  be  asserted  in  the  basic  scheme  because  an  Ip  need 
not  send  out  messages  with  higher  t-component  values  than  the  input 
messages. 

We  now  claim  that  the  edge  clock  value  for  every  edge  is  at  least  z.  If 
not,  consider  an  edge  e{  for  some  Ip,  whose  edge  clock  value  is  tj,  with  tj 

<  2.  According  to  the  above  observation,  there  exists  an  edge  c„,  which  is 
an  incoming  edge  to  this  lp,  such  that  e.,'s  edge  clock  value  is  t.>,  where  t., 

<  tj.  Continuing  in  this  manner,  we  can  construct  a  sequence  of  edges, 
e1,e.,,...,ej,..  such  that  for  all  i,  ei+J  is  a  predecessor  edge  of  e,  and  tj+|  <  t( 
and  we  have,  t|  <  z.  Since  the  physical  network  is  finite,  we  will 
eventually  either  (i)  get  to  a  source  Ip,  or  (ii)  we  will  have  a  cycle  of  edges. 

In  the  first  case,  since  every  source  lp  sends  messages  until  the  t-component. 


Table  4-1:  Message  Transmissions  in  the  Simulation  of  Example  1 
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of  the  last  message  sent  is  z,  we  can  not  have  edge  clock  value  of  any 
outgoing  edge  of  a  source  tp  smaller  than  z.  In  the  second  case,  all  edge 
clock  values  in  the  cycle  are  equal  to  tj  and  t,  <  z.  From  the 
predictability  property  (chapter  2),  for  this  cycle  and  this  tj,  there  exists  a 
.-pp,  say  ppj,  whose  outputs  can  be  determined  beyond  t^  given  its  inputs 
up  to  tj.  Hence  lpj  has  some  messages  to  send,  which  contradicts  our 
assumption  that  the  simulation  has  terminated.  Therefore  the  edge  clock 
value  of  every  edge  is  at  least  z  and  hence  the  simulation  clock  value  is  at 
least  z. 

We  have  implicitly  used  the  fact  that  for  any  finite  z,  only  a  finite 
number  of  messages  may  be  transmitted  in  the  logical  system.  This  is 
derived  from  the  predictability  property,  in  which  the  parameter  £,  6  >  0, 
is  a  fixed  quantity.  A  more  rigorous  proof  of  this  boundedness 
property  may  be  found  in  [7j. 

Discussion 

It  is  interesting  to  note  that  the  simulator  never  deadlocks:  if  the  physical 
system  deadlocks,  the  simulator  continues  computation  by  transmitting  null  messages 
with  increasing  t- values.  This  correctly  simulates  the  corresponding  physical 
situation,  in  that  while  time  progresses,  no  messages  are  transmitted  in  the  physical 
system,  ritimately,  the  simulator  will  terminate  with  every  clock  value  at  least  at 
z.  The  simplicity  of  this  scheme  is  one  of  its  most  attractive  points.  It  requires  small 
coding  changes  in  existing  distributed  simulations  to  send  out  null  messages. 
Furthermore,  the  requirement  oT  unbounded  buffers  between  two  Ip’s  is  not  really 
necessary.  The  same  results  hold  if  there  are  only  a  finite  number  of  buffer 
spaces  between  every  lpj  and  lp^  and  lpj  has  to  wait  to  send  if  all  buffer  spaces  are 
currently  full.  The  proof  that  there  is  no  deadlock  in  this  situation  is  essentially 
contained  in  [7], 

{empirical  studies  [*2S]  show  that  this  scheme  is  quite  efficient  for  acyclic 
networks.  Several  factors  seem  to  affect  the  efficiency: 

(1)  Degree  of  Branching  in  the  Network 

Consider  a  network  with  one  source  and  one  sink.  The  number  of  distinct 
paths  between  the  source  and  the  sink  is  a  (rough)  measure  of  the  amount  of 
brandling  in  the  network.  Null  messages  tend  to  get  created  at  branches  and  they 
may  proliferate  at  all  successive  branches  (if  not  subsumed).  So  one  would  expect 
that  the  fewer  the  number  of  branches,  the  better  the  performance.  Empirical 
studies  [28]  seem  to  confirm  this.  Theoretically  optimum  efficiency  is  achieved  for  a 
tandem  network  (the  assembly  line  example  of  chapter  2,  Example  2-4),  and  excellent 
results  are  obtained  for  low-branching  type  networks.  in  general,  acyclic 
networks  exhibit  reasonably  good  performance  levels.  Mote  that  the  metric  of 
interest  in  performance  calculations,  is  the  turnaround  time,  i.e.  the  amount  of  time 
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it  takes  to  complete  tlie  simulation,  rather  than  processor  utilization,  i.e.  the  fra "t ion 
of  time  the  processors  are  utilized.  In  fact,  one  would  expect  the  processors  to  he 
lightly  utilized.  The  other  parameter  of  interest,  line  bandwidth,  has  not  received 
adequate  attention. 

Experiments  were  carried  out  by  Peacock,  Wong  and  Manning  [2-1,25]  on 
networks  of  various  topologies.  Their  conclusions:  “for  some  topologies  of  queueing 
networks  models,  this  approach  results  in  a  speedup  in  the  total  time  to  complete  a 
given  simulation.  However,  for  other  topologies,  especially  those  with  loops,  the 
speed-up  may  not  be  significant.*  They  also  investigated  several  different  ways  of 
partitioning  the  physical  network  so  that  more  than  one  pp  may  be  implemented  on 
one  lp. 

(2)  Time-Out  Mechanisms  to  Prevent  Null  Message 
Transmission 

A  slight  modification  may  save  a  considerable  number  of  message  transmissions. 
A  null  message  (t,m)  has  no  effect  if  it  is  followed  by  another  message  (t’.m),  t’>t. 
Therefore  it  may  be  efficient  to  delay  transmissions  of  null  messages  in  the  hope  that 
future  messages  received  by  an  Ip  would  make  it  unnecessary  to  transmit  them  at  all. 
Clearly  the  amount  of  time,  r,  that  an  Ip  waits  before  transmitting  a  null  message  is 
of  importance.  If  r  =  0,  we  have  the  algorithm  as  stated  in  this  chapter.  If  r  =  oo, 
null  messages  are  never  transmitted  and  then  we  have  the  basic  algorithm  of  chapter 
3,  which  may  lead  to  deadlock.  Other  values  of  r  are  of  potential  interest,  but  no 
empirical  studies  have  been  performed  to  substantiate  our  claims. 

(3)  Amount  of  Buffering  on  Edges 

The  number  of  buffer  spaces  on  edges  seem  to  have  substantial  effects  on 
performance  [26,28].  When  the  number  of  buffer  spaces  was  reduced  to  0,  senders 
had  to  wait  until  the  receivers  were  ready  to  receive,  and  a  considerable  amount  of 
time  seemed  to  be  spent  in  waiting.  The  number  of  buffer  spaces  was  then  increased 
and  the  following  rule  was  used  to  annihilate  null  messages:  any  message  put  in  the 
buffer  after  a  null  message  (and  therefore  with  a  higher  t-component)  annihilates  any 
null  message  ahead  of  it  still  in  the  buffer.  The  annihilation  rule  is  somewhat  similar 
to  the  time-out  mechanism.  It  was  found  that  in  the  simulation  of  a  certain  class  of 
queueing  networks  the  performance  improved  rapidly  until  the  number  of  buffer 
spaces  on  an  edge  approached  10,  increased  less  rapidly  until  about  20,  and  remained 
essentially  unchanged  thereafter.  These  numbers  however  cannot  be  applied  directly 
for  other  problems;  we  expect  these  numbers  to  depend  on  the  type  of  problem  and 
the  speeds  of  processors  and  lines. 

We  discuss  various  issues  related  to  empirical  investigations  in  the  next  chapter. 
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4.4.  Demand  Driven  Null  Message  Transmission 

Another  variation  with  null  message  transmissions  is  not  to  transmit  a  null 
message  until  asked  to  do  so.  In  this  scheme,  an  Ipj  may  receive  an  inquiry  from 
another  Ipj  where  there  is  an  edge  from  lp|  to  Ip^.  Ipj  sends  an  inquiry  to  find  out. 
when  Ipj  will  send  it  the  next  message.  If  Ipj  has  a  genuine  message  to  send  or  a  null 
message,  which  will  advance  the  edge  clock  value,  it  will  do  so  in  response  to  this 
inquiry.  If  it  cannot  send  any  such  message,  lpj  must  itself  be  waiting  for  one  or 
more  of  its  incoming  edge  clock  values  to  advance  and  hence  it  propagates  thi» 
inquiry  backward  along  those  edges.  The  inquiry  may  be  propagated  along  a 
sequence  of  edges.  Ipj  must  remember  to  respond  to  the  inquiry  as  soon  as  it  can,  i.o., 
as  soon  as  its  own  clock  value  advances.  An  lp  may  receive  several  inquiries  before 
responding  to  any  of  them.  In  this  case,  it  will  propagate  at  most  one,  wait  for  the 
reply  and  then  reply  to  the  others. 

A  particularly  interesting  part  of  this  scheme  is  the  detection  of  deadlock.  In  a 
situation  as  in  Example  3-5,  an  inquiry  initiated  by  x  is  propagated  backward  and 
arrives  at  x.  Ip  x  can  then  detect  deadlock.  Resolution  of  deadlock  requires  finding 
the  lp  which  has  the  smallest  edge  clock  value  t  along  some  input  edge,  ignoring  the 
set  of  deadlocked  edges.  This  Ip  can  then  assert  that  it  will  receive  no  more  input,  up 
to  t,  along  the  deadlocked  edge.  Therefore  it  continues  simulation  assuming  that  it 
has  received  (t.null)  along  the  deadlocked  edge.  In  this  example,  x  is  the  only  process 
having  edges  outside  the  deadlocked  set.  Therefore  x  simply  stops  waiting  to  receive 
from  z  and  advances  its  clock  based  on  input  from  the  source  alone. 

The  claim  that  the  inquiry  propagation  mechanism  does  indeed  detect  deadlock 
and  that  at  most  one  inquiry  by  an  Ip  is  outstanding  at  any  time,  is  not  entirely 
trivial  to  prove;  see  [  10, 1 1  j  for  discussions  of  a  similar  problem  and  its  proof.  A 
reasonable  heuristic  for  an  Ip  to  initiate  an  inquiry  may  be  based  on  time-outs. 


4.5.  Rollback  and  Recovery 

A  scheme  suggested  by  Jefferson  and  Sowizral  [18]  allows  an  Ip  to  proceed  with 
its  computation,  with  the  belief  that  it  will  receive  no  further  input  along  an 
incoming  edge  if  it  has  not  received  any  during  a  certain  time  period.  Suppose  that 
Ipj  changes  its  state  from  s  to  s’  and  sends  out  messages  MpVr,,...,  as  a  result  of  this 
belief.  Suppose  that  in  the  future,  a  message  is  received  along  an  edge  which 
contradicts  this  assumption.  Then  the  state  of  the  Ip  must  be  rolled  back  to  s;  in 
addition,  states  of  other  Ip’s  which  may  have  received  must  also  be  rolled 

back.  It  is  proposed  to  use  a  stack  in  which  some  of  the  recent  states  of  an  Ip  may  be 
retained;  the  bottom  of  the  stack  is  a  guaranteed  correct  state  at  some  point,  and 
hence  there  will  be  no  further  rollback  beyond  that  state.  ■  Antimessages*  Mj'.M./... 
are  sent  to  cancel  the  effects  of  the  corresponding  messages  and  roll  back  the  states  of 
the  Ip’s  which  previously  received 
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If  processor  speeds,  speed  of  simulating  a  pp  by  an  Ip,  line  delays,  etc.  can  be 
accurately  predicted,  this  method  may  turn  out  to  be  quite  practical.  In  such  cases, 
one  would  expect  to  have  few  rollbacks.  However  it  seems  that,  in  general,  large 
amounts  of  memory  would  be  required  to  stack  the  states  and  a  large  number  of 
antimessages  will  have  to  be  transmitted  whenever  a  rollback  is  required. 

4.6.  Circulating  Marker  for  Deadlock  Detection  and  Recovery 

A  suggestion  has  been  made  in  [9]  to  let  the  basic  simulation  scheme  deadlock, 
detect  deadlock  and  recover  from  it.  Deadlock  is  infrequent,  as  has  been  suggested 
by  Quinlivan  [26)  from  a  number  of  empirical  studies  on  queueing  networks. 
Therefore  one  would  expect  this  to  be  a  viable  alternative  if  deadlock  detection  can 
be  implemented  efficiently.  Dev  Kumar  [20]  has  used  a  recent  deadlock  detection 
scheme  [22]  to  implement  such  an  algorithm.  We  now  discuss  his  method  and  several 
of  its  variations. 

Consider  a  marker  that  continuously  circulates  in  a  network.  It  follows  a  cycle 
of  edges  such  that  it  traverses  every  edge  of  the  network  sometime  during  a  cycle 
-  such  a  cycle  exists  if  the  network  is  strongly  connected;  new  edges  may  be  added  to 
the  network  to  make  it  strongly  connected.  The  marker  is  merely  a  special  type  of 
message.  It  initially  starts  at  some  lp.  If  an  lp  receives  the  marker,  its  obligation  is 
to  send  the  marker  (along  its  designated  route)  within  a  finite  time  of  being  idle  (i.e., 
not  having  anything  more  to  send).  We  let  the  marker  carry  some  information  for 
deadlock  detection,  as  described  below. 

Each  lp  will  have  a  one-bit  flag  to  show  whether  the  lp  has  received  or  sent  a 
message  since  the  hist  visit  of  the  marker.  We  say  that  an  Ip  is  while  if  it  has  neither 
received  nor  sent  a  message  since  the  last  visit  of  the  marker  to  that  Ip;  the  ip  is 
black  otherwise.  Initially  all  Ip's  are  black.  The  marker  declares  deadlock  when  it 
finds  that  the  last  N  Ip’s  that  it  has  visited  were  all  white  (when  the  marker  arrived 
at  the  lp),  where  N  is  the  number  of  Ip’s  in  the  network.  This  result  holds  if 
messages  between  two  Ip's,  including  the  marker,  are  received  in  the  order  sent;  see 
[22]  for  a  precise  description  and  proof  of  this  result. 

We  can  use  this  scheme  to  detect  and  recover  from  deadlock.  The  marker,  in 
addition  to  keeping  the  number  of  white  Ip's  it  has  seen  since  it  last  saw  a  black  Ip, 
carries  the  minimum  of  ■next-event-times*  for  the  white  Ip’s  it  visits:  each  white  lp 
can  report  the  time  of  the  next  event,  assuming  it  receives  no  further  messages,  to  the 
marker  and  the  marker  merely  keeps  track  of  the  smallest  of  these,  and  the 
corresponding  Ip.  When  the  marker  detects  deadlock,  it  knows  the  next  event  time 
and  the  Ip  at  which  this  next  event  occurs.  Therefore,  it  can  restart  that  Ip. 
Alternately,  a  central  process  may  broadcast  (send  messages  to  all  Ip's)  to  advance 
their  docks  to  the  next  event  time  in  the  system. 


The  overhead  messages  in  this  case,  are  for  marker  transmissions.  If  deadlocks 
are  infrequent,  the  marker  may  be  made  to  move  slowly  (and  therefore  the  deadlock 
may  be  detected  quite  some  time  after  its  occurrence)  and  hence  the  proportion  of 
overhead  messages  to  genuine  messages  will  be  low. 

4.7.  Circulating  Marker  for  Deadlock  Avoidance 

The  marker  scheme  of  the  last  section  can  also  be  used  for  deadlock  avoidance. 
The  idea  is  to  let  the  marker  carry  messages.  If  Ipj  is  sending  the  marker  to  Ipj,  it 
may  send  a  message  (t,null),  advancing  that  edge’s  clock  value  as  much  as  possible. 
If  Ipj  cannot  advance  the  clock  value  of  the  edge  to  Ipj,  it  still  must  send  the  marker, 
without  a  message,  in  finite  time.  The  marker  carries  no  further  information.  Using 
essentially  the  same  arguments  as  in  theorem  2  of  this  chapter,  the  system  can  be 
shown  to  be  deadlock-free. 

Overhead  messages  are  for  marker  transmission;  however,  unlike  null  messages 
there  is  no  proliferation  of  such  messages.  Another  way  to  view  this  ^eheme  is  to 
consider  the  marker  as  a  circulating  packet  which  carries  only  null  messages  (or  is 
empty)  and  delivers  the  messages  to  their  destinations.  The  number  of  null  message 
transfers  is  bounded  by  the  marker’s  rate  of  traversal.  By  suitably  adjusting  the 
speed  of  the  marker,  i.e.,  the  length  of  time  for  which  an  Ip  holds  the  marker  before 
sending  it,  we  can  expect  to  reduce  the  number  of  overhead  messages  and  still  avoid 
long  delays  by  the  Ip’s. 

Dev  Kumar  is  currently  investigating  the  performance  of  these  schemes  and 
several  variations  of  these,  including  the  use  of  multiple  markers. 


Chapter  5 

Summary  and  Conclusion 
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5.  Summary  and  Conclusion 
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In  this  chapter  we  summarize  the  discussions  about  distributed  simulation,  its 
status,  problems  and  future  research  directions.  We  hope  to  have  demonstrated  that 
distributed  simulation  may  be  applied  in  every  situation  where  sequential  discrete 
event  simulation  can  be  applied.  Our  examples  have  been  predominantly  from  the 
area  of  computer  systems,  since  a  queueing  network  description  of  a  computer  is  a 
physical  system  in  our  method.  However,  our  physical  systems  encompass  a  large 
variety  of  real  world  applications;  the  only  difference  from  sequential  simulation 
modelling  is  to  think  in  terms  of  pp’s  and  messages  rather  than  entities  and  events. 

We  have  presented  the  methods  of  distributed  simulation,  but  wc  have  not 
shown  how  these  may  be  implemented  on  existing  or  future  machines.  We  first  note 
that  simulation  of  a  pp  by  an  Ip  can  be  realized  in  any  simulation  language  -  some 
particularly  suitable  ones  are  SIMULA  [13],  CSP  [16,19],  MAY  [*2],  ADA  [  1 J , 
DEMOS  (5),  SAMOA  [21].  All  these  languages  provide  enough  abstraction 
mechanisms  to  describe  the  behaviors  of  elementary  components  and  message 
communications  among  them.  Hence,  we  contend  that  distributed  simulation 
requires  nothing  more  than  a  language  for  creating  sequential  processes  and 
specifying  their  communications. 

Implementation  of  distributed  simulation  therefore  reduces  to  implementation 
of  a, message-communicating  set  of  processes  on  some  architecture.  The  logical 
system  should  then  be  partitioned  among  various  processors  in  such  a  manner  that 
the  message  traffic  among  various  parts  is  as  low  as  possible.  Message 
communication  may  be  accomplished  either  through  a  common  memory  (messages 
are  deposited  in  a  common  memory  by  the  sender  and  removed  by  the  receiver)  or  by 
other  interaction  mechanisms  among  processors.  The  important  criterion  is  how 
loosely  coupled  the  processors  are.  If  two  processors  are  tightly  coupled,  i.c.,  the 
logical  processes  ou  these  processors  exchange  a  large  number  of  messages,  then  the 
processors  must  also  exchange  at  least  that  many  messages  and  therefore  the  message 
traffic  will  be  heavy.  If  processors  are  loosely  coupled,  they  can  operate 
autonomously,  i.e.  without  communicating  with  other  processors,  for  longer  periods  of 
lime.  It  is  also  easier  to  avoid  deadlock  among  a  set  of  logical  processes  if  they  are 
simulated  on  one  processor. 

We  have  not  yet  explored  the  possibility  of  deadlock  detection  by  a  global 
processor  which  continuously  observes  message  transmissions  through  the  common 
memory.  Unlike  the  manager  of  the  event-list  in  the  sequential  simulation  of  Chapter 


2,  this  global  processor  remains  completely  passive,  i.e.,  in  the  background,  until  it 
detects  deadlock.  The  global  processor  can  resolve  deadlock  in  an  elegant  manner:  it 
transmits  a  null  message  (  by  depositing  it  in  the  proper  memory  location  J  which 
advances  the  edge  clock  value  of  an  appropriate  edge,  such  as  zx  in  Example  d- 5 . 
This,  technique  seems  to  be  a  viable  alternative  when  simulation  is  attempted  on 
multiple  processors  sharing  a  common  memory. 

Static  partitioning  of  the  physical  network  among  a  fixed  number  of  processors 
requires  preprocessing  prior  to  simulation.  Preprocessing  is  useful  for  many  other 
reasons.  In  the  circulating  marker  algorithm,  preprocessing  is  needed  to  determine  a 
(static)  cyclic  path  for  the  marker.  Preprocessing  could  also  be  used  to  partition  the 
Ip’s  such  that  the  amount  of  branching  is  reduced  and  cycles  are  mostly  contained 
within  one  processor.  Preprocessing  can  determine  other  simulation  parameters  such 
as  when  to  time-out,  sizes  of  buffers  on  edges,  etc.  This  is  an  area  that  has  been 
extensively  studied  for  sequential  simulations.  It  needs  to  be  studied  again  fur 
distributed  simulation  since  the  problems  are  somewhat  different  in  nature. 

We  have  sketched  several  variations  of  the  basic  scheme  for  deadlock 
resolution.  There  is  little  evidence  yet  of  the  superiority  of  any  one  scheme.  Tim 
large  number  of  heuristics  suggests  that  some  combination  may  be  appropriate  fur 
particular  problem  domains.  For  instance,  if  we  use  a  set  uf  uniform  processors 
among  which  message  communication  is  expected  to  be  regular,  we  can  expect  that 
deadlock  will  rarely  arise  and  therefore  (a  slowly)  circulating  marker  scheme  would 
be  preferable.  The  circulating  marker  scheme  also  seems  to  be  attractive  in  that  it 
can  be  used  (hopefully  without  much  overhead)  in  more  general  cases.  Alsu  the 
marker  can  be  used  to  collect  statistical  information  about  the  simulation  itself  and 

hence  the  simulation  parameters,  such  as  time-outs,  can  be  dynamically  changed 
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We  have  not  discussed  specific  architectures  that  can  support  simulation. 
There  is  not  enough  experience  with  distributed  simulation  to  know  (1)  where 
distributed  simulation  spends  most  of  its  time,  and  (2)  whether  any  architectural 
improvement  would  be  uniformly  useful  for  all  problems.  At  present,  any 
architecture  that  supports  (static)  creation  of  processes  and  communications  among 
them  would  be  appropriate. 

The  circulating  marker  scheme  seems  attractive  for  hardware  implementation. 
A  hardware  marker,  analogous  to  the  token  in  a  token-ring,  could  cycle  among  the 
processes.  Processes  send  genuine  messages  as  before.  Our  requirement  that  messages, 
including  the  marker,  be  delivered  in  sequence  sent,  along  an  edge,  can  be  met  as 
follows:  when  the  marker  is  sent  from  lp;  to  Ipj,  it  is  given  the  t-component  of  the  last 
message  sent  by  Ip{  to  Ipjj  when  the  marker  arrives  at  Ip-,  it  stays  there  until  Ip.  has 
received  a  message  with  a  t-component  equal  to  or  higher  than  the  one  that  the 
marker  has.  Otherwise  the  marker  algorithm  operates  as  before.  Advantage  of  a 
hardware  marker  is  that  the  .simulation  will  spend  no  lime  in  overhead  messajc 


transmissions.  The  simplicity  of  the  marker  traversal  scheme  makes  it  feasible  to 
implement  it  in  hardware. 

We  next  discuss  some  of  the  glaringly  open  problems  in  distributed  simulation 
The  most  important  current  problem  is  empirical  investigations  of  various 
heuristics  on  a  wide  variety  of  problems  to  establish,  (1)  which  heuristics  work  well 
for  which  problems  and  on  which  machine  architectures,  (2)  how  to  partition  the 
physical  system  among  a  fixed  set  of  processors,  and  (3)  how  to  set  simulation 
parameters  such  as  time  outs  and  buffer  sizes,  etc.  Some  of  the  difficulties  in 
empirical  studies  are  listed  below.  First,  it  is  useful  to  have  a  distributed  architecture 
on  which  measurement  capabilities  exist,  for  implementation  of  the  distributed 
simulation  algorithm.  The  advantage  of  such  a  scheme  is  that  processor  and  line 
speeds  are  realistic  and  that  the  implementation  is  quite  straightforward  Another 
possibility  is  to  first  use  a  sequential  simulator  to  simulate  a  distributed  architecture 
and  then  implement  the  simulation  algorithm  on  this  (simulated)  distributed 
architecture.  One  advantage  is  that  the  architecture  can  be  continuously  varied  and 
its  effect  on  simulation  studied.  This  is  the  approach  that  is  currently  being  taken  at 
the  University  of  Texas  at  Austin.  MAY  [2),  a  sequential  process  simulation  language 
is  being  used  to  describe  the  distributed  architecture  and  the  distributed  simulation 
algorithm.  MAY  is  itself  a  simulation  tool  and  hence  its  statistics-gathering 
mechanisms  are  used  to  collect  and  analyze  the  behaviors  of  various  distributed 
simulation  schemes. 

A  major  disadvantage  of  this  2-tier  approach  is  the  actual  CPU  time  required 
to  run  experiments.  Not  only  does  each  experiment  take  longer,  but  the  ease  with 
which  the  parameters  of  the  experiments  can  be  changed  has  encouraged  us  to 
attempt  many  more  experiments.  A  multiprocessor  architecture,  perhaps  with  a 
common  memory,  would  provide  an  ideal  simulation  environment. 

Traditional  simulation  issues  have  not  been  addressed  in  this  monograph:  what 
data  to  collect,  how  to  collect  it  in  a  distributed  manner,  how  to  repeat  experiments 
fur  statistical  validity  (a  new  experiment  may  be  started  even  before  an  older  one  is 
completely  over),  etc.  We  feel  that  it  is  premature  to  address  these  issues  without  a 
firm  understanding  and  resolution  of  the  most  basic  issues. 
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Abstract  -  A  distributed  program  consists  of  processes,  many 
of  which  can  execute  concurrently  on  different  processors  in  a 
distributed  system  of  processors.  When  several  processes 
from  the  same  or  different  distributed  programs  have  been 
assigned  to  a  processor  in  a  distributed  system,  the  processor 
must  select  the  next  process  to  run.  The  following  two  ques¬ 
tions  are  investigated:  What  is  an  appropriate  method  for 
selecting  the  next  process  to  run!  Under  what  conditions  are 
substantial  gains  in  performance  achieved  by  an  appropriate 
method  of  selection?  Standard  processor  queueing  discip¬ 
lines,  such  as  first-eome-first- serve  and  round-robin-fixed- 
quantum,  are  studied.  The  results  for  four  classes  of  queue¬ 
ing  disciplines  tested  on  three  problems  are  presented.  These 
problems  were  run  on  a  testbed,  consisting  of  a  compiler  and 
simulator  used  to  run  distributed  programs  on  user-specified 
architectures. 

1.  Introduction 

When  a  problem  has  large  computational  demands  and 
there  is  a  network  of  processors  available,  a  programmer  can 
ulilire  the  computational  power  of  many  processors.  The  pro¬ 
grammer  divides  a  problem  so  that  pieces  of  the  problem  can 
be  computed  in  parallel.  It  is  common  to  see  processors  con¬ 
nected  by  local  area  networks.  To  effectively  run  a  distri¬ 
buted  program  on  a  local  area  network  as  well  as  other  inter¬ 
connection  networks,  a  good  queueing  discipline  must  take 
into  account  that  its  processor  and  other  processors  have 
pieces  of  the  same  program. 

When  several  processes  from  the  same  or  different  distri¬ 
buted  programs  have  been  assigned  to  a  processor  in  a  distri¬ 
buted  system,  an  important  desigo  question  is  bow  a  proces¬ 
sor  selects  the  next  process  to  run.  This  problem  has  not 
been  considered  in  a  distributed  environment.  An  interesting 
question  arises:  He  »  do  the  processes  at  other  processors  and 
communication  delays  in  the  system  impact  the  selection  of 
the  nevt  process  to  run?  As  a  beginning  study  we  hare  inves- 
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tigated  the  standard  queueing  disciplines  -  first-come-first- 
serve,  round-robin-fixed-quantum,  preemptive  priority,  and 
nonpreemptive  priority  -  in  a  distributed  environment  The 
study  shows  that  the  response  time  metric  can  differ  by  50% 
with  different  choices  of  queueing  disciplines  for  three  prob¬ 
lems. 

Another  important  que-tion  is  under  what  conditions 
are  substantial  gains  in  performance  achieved  by  an 
appropriate  method  of  selection.  Communication  delays  are  a 
factor;  thus  a  graph  of  the  response  time  metric  was  plotted 
as  communication  delays  varied  for  each  of  the  tbree  prob¬ 
lems.  Trends  are  observed  in  these  graphs.  A  rationale  for 
the  trends  is  given  based  on  several  factors. 

The  queueing  disciplines  were  studied  with  three  prob¬ 
lems  that  differ  functionally  and  have  different  behavioral 
characteristics.  The  partial  differential  equation  solver  is 
based  on  an  iterative  grid  technique  that  is  similar  to  those 
used  in  multidimensional  applications  such  as  weather  predic¬ 
tion,  structural  mechanics,  hydrodynamics,  heat  transport, 
and  radiation  transport.  The  centralized  monitor  has  the 
typical  tree  structure  of  hierarchically  designed  applications. 
The  producer-consumer  pairs  represent  a  multiprogramming 
environment  id  ibr  distributed  system  and  each  pair  is 
representative  of  s  large  class  of  problems.  The  different 
behavioral  characteristics  are  described  in  Section  5. 

Id  Section  2  a  model  of  the  distributed  architecture  and 
the  distributed  language  are  described.  The  metric  for  rom- 
partug  the  performance  of  the  different  queueing  divipliues 
and  a  description  of  the  testbed  are  given  in  Section  3.  In 
Section  4  we  give  a  heuristic  for  assigning  priorities  for  tbe 
priority  dependru:  queueing  disciplines  Section  5  describes 
tbe  distributed  programs  and  architectures  or  which  rarh 
problem  executes  The  results  are  given  in  Section  0  S  'c- 
tion  7  describes  the  impart  of  qururing  disciplines  In 
Appendix  A  a  more  detailed  description  of  the  simulator  is 
given 

I,  Model  of  Dlatrlbuted  Computing 

1.1.  Distributed  Architecture 

Tbe  distributed  architecture  is  characterised  by  tie 
number  of  processors,  tbe  speed  of  each  processor,  the  queue¬ 
ing  discipline  at  each  processor,  and  tbe  lines  that  connect 
the  processors.  The  lines  may  have  different  rapacities, 
lengths,  and  error  rates.  Tbe  processors  bsve  no  shared 
memory  and  they  communicate  only  by  messages.  We 
assume  that  any  processor  can  communicate  with  any  other 
processor  by  routing  messages  through  intermediate 
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proectrors  over  fixed  paths. 

1.1,  Distributed  LufUift 

A  program  io  the  distributed  language  consists  of 
processes  that  communicate  and  share  data  by  using  mes¬ 
sages.  The  language  is  similar  to  CSP,  which  is  described  in 
(2].  The  language  uses  synchronous  (blocking)  communica¬ 
tion  primitives;  the  sending  process  cannot  proceed  until  the 
receiving  process  is  ready  to  receive  the  message.  The  two 
message  passing  constructs  are  the  I/O  statements, 
SEN'D(l/0  variable)  and  RECEIVE(I/0  variable).  For  each 
corresponding  pair,  SEND(I/0  variable)  and  RECEIVE(I/0 
variable),  that  is  executed,  there  are  two  nonblocking  mes¬ 
sages  sent  at  the  protocol  level  that  implements  the  language. 
In  this  language  there  is  a  static  number  of  processes. 
Dynamic  creation  of  processes  is  simulated  by  a  process 
beginning  execution  only  after  some  other  process  sends  it  a 
message. 

14.  Terminology 

We  define  virtual  line  time  for  a  message  between  two 
processors  connected  directly  by  a  line  as  the  product  of  the 
actual  time  to  move  the  message  over  the  line  and  a  constant 
derived  from  line  reliability  and  the  overhead  of  lower  levet 
protocols.  The  actual  time  to  move  the  message  over  the  line 
is  the  usual  function  of  message  length  in  message  units 
(packets),  number  of  bits  per  message  unit,  line  rapacity,  and 
line  length.  Virtual  line  time  does  not  include  the  time  a 
message  waits  to  use  the  communication  subnet.  Virtual  line 
time  for  a  message  between  two  processors  is  the  sum  of  the 
virtual  line  times  for  (be  lines  on  the  route.  Currently  in  local 
area  networks,  lower  level  protocols  executing  in  the  proces¬ 
sors  usually  reduce  the  physical  line  capacity  by  at  least  a 
factor  of  10  for  any  message  |l|.  Vii.ual  line  time  reflects  this 
effective  line  capacity. 

The  message  delay  of  a  process  for  a  synchronous  com¬ 
munication  as  in  CSP  is  a  function  of  virtual  line  time, 
queueing  at  the  port  queues  on  the  route  in  a  store  and  for¬ 
ward  network,  and  the  processing,  waiting,  and  queueing 
time  of  the  corresponding  process  at  its  processor.  Message 
delays  can  be  very  large  compared  to  a  process's  processing 
time  between  communications. 

In  the  testbed  1  unit  of  time  can  be  thought  of  as  1 
ps.  For  local  area  networks  where  processors  are  1  km  apart, 
transmission  rates  of  10  Mbit/s  are  common.  For  a  packet  of 
250  bits  it  takes  approximately  20  ps  to  send  a  packet  over 
the  line.  With  the  factor  of  10  or  more  for  lower  level 
protocols,  300  time  units  is  a  reasonable  number  for  virtual 
line  lime  in  this  model  of  a  local  area  network. 

An  important  performance  factor  to  consider  is  the  tra¬ 
deoff  of  processing  time  versus  communication  delay.  To  do 
a  study  on  this  tradeoff  we  must  vary  either  processing  or 
communication  time.  Since  each  process  of  a  program  has  a 
fixed  amount  of  processing,  we  have  chosen  to  vary  the  com¬ 
munication  time  to  study  (bis  tradeoff.  Even  though  virtual 
line  time  varies  beyond  wbat  may  seem  reasonable  for  our 
model,  we  stress  that  it  is  the  ratio  of  processing  time  to  com¬ 
munication  delay  (bat  is  actually  changing  and  the  ratio  is  a 
more  meaningful  factor  to  ronsider. 

For  each  problem  in  this  paper,  we  assume  ail  the  pro¬ 
cessors  have  the  same  speed,  all  lines  arc  identical,  and  a 
message  unit  is  250  bits.  We  also  assume  that  on  any  simula¬ 
tion  run  all  processors  have  the  same  queueing  discipline 


These  assumptions  are  made  to  isolate  the  efferts  of  tkr 
choice  of  queueiug  discipline  from  other  system  variables 

S.  Testbed  and  Metric 

The  metric  for  comparing  various  queueing  disciplines  is 
defined  as  follows.  All  the  processes  of  a  distributed  program 
are  assumed  to  start  at  time  tero  Each  process  i  terminates 
at  some  time,  t(i).  The  metric  is  (be  sum  over  N  processes  of 
the  termination  times  t(i)  divided  by  N,  and  is  termed  the 
average  of  the  process  termination  limes  (APTT).  APTT 
reflects  both  the  instruction  processing  requirements  of 
processes  aDd  the  message  delays  Total  time,  defined  as  the 
maximum  t(i),  is  not  always  a  good  metrir  for  comparing 
queueing  disciplines,  because  when  message  delays  are  very 
small,  total  time  is  comparable  for  all  queueiug  disciplines. 

The  testbed  runs  distributed  programs  coded  in  tbe  div 
tributed  language  mentioned  above,  which  is  similar  to  CSP. 
In  addition  to  the  distributed  program,  tbe  testbed  also 
requires  a  specification  of  the  distributed  architecture.  The 
testbed  consists  of  a  compiler,  interpreter,  and  simulator. 
The  compiler  produces  pseudo-instructions  for  the  hypotheti¬ 
cal  processors  in  the  distributed  system  Tbe  interpreter  exe¬ 
cutes  the  pseudo-instructions  The  simulator  manages  the 
interpreter,  processor  queurs,  and  port  queues  and  executes 
protocol  routines.  The  simulator  is  based  on  the  work 
presented  in  j-t]  and  was  validated  extensively  using  commer¬ 
cial  analytical  and  simulation  packages  (3, 5).  A  more 
detailed  description  of  the  simulator  is  given  in  Appendix  A. 

4.  Queueing  Disciplines 

The  queueing  disciplines  tested  were  first-come-first- 
serve  (FCFS),  round-robin-fixed-quantum  (RRFQ), 
nonpreemptive-priority  (Nl’P),  and  preemptive-priority  (PP) 
jt].  The  two  priority  disciplines  NPP  and  PP  must  assign 
priorities  to  the  processes.  In  a  PP  discipline  if  an  expected 
message  arrives  for  a  blocked  process  of  higher  priority,  the 
blocked  process  preempts  the  currently  running  process.  In 
tbe  following  discussion  wc  motivate  and  give  a  heuristic  for 
assigning  priorities. 

Suppose  that  there  are  three  processes  on  a  processor 
ready  to  execute  and  that  only  one  of  these  proresses,  protess 
I.  must  ever  communicate  across  a  line  with  another  process 
on  a  different  processor.  Prorrssos  2  and  3  communicate  with 
rarh  other  and  process  I.  A  good  discipline  will  first  let  pro¬ 
cess  I  execute  and  block  for  communication  across  the  line. 
While  process  I  is  blocked,  processes  2  and  3  are  executed 
Hopefully,  a  message  will  arrive  for  process  1  and  wake  it  up 
so  tbat  it  is  ready  to  execute  before  procr*ses  2  and  3  block 
A  poor  discipline  will  always  execute  processes  2  and  3  before 
proress  1  Thus,  all  processes  are  blocked  until  a  message 
arrives  for  proress  I,  tbe  processor  is  idle  for  a  longer  period 
of  time  waiting  on  a  message.  The  good  discipline  reduces 
the  idle  periods  of  tbe  processor  and  thus,  decreases  the  time 
when  tlir  processor  finishes  executing  all  processes.  Tbe  good 
discipline  must  also  determine  which  of  process  2  or  process  3 
to  execute  first.  A  good  selection  depends  on  the  rbaracteriv 
ties  of  these  processes. 

Generally  we  have  observed  tbat  scheduling  a  single 
processor  in  a  distributed  architecture  must  be  analyred  con¬ 
sidering  both  the  single  processor  (local  component)  aud  the 
distributed  environment  (global  component).  Our  heuristic 
for  assiguing  priorities  is  given  as  follows: 

•  Proresses  that  communicate  across  a  line  are  assigned 

high  priority  (highest  priority  when  message  delays  are 


large  since  the  global  component  is  more  important). 

•  A  profess  on  which  several  other  processes  may  wait  (a 
bottleneck  process)  is  assigned  high  priority  (highest 
priority  when  message  delays  are  small  since  the  local 
component  is  more  important). 

•  Any  other  processes  are  assigned  lower  priorities  to 
approximate  shorlest-remaining-time-first  (SRTF)  [4]. 

Thus  a  good  priority  discipline  should  generally  give  highest 
priority  to  those  processes  communicating  across  a  line  in 
order  to  minimite  the  processor  idle  periods  and  thus  to 
finish  executing  all  processes  at  the  processor  sooner.  The 
discipline  should  be  preemptive  so  that  messages  over  the  line 
can  be  received  by  the  corresponding  process  as  quickly  as 
possible.  Choosing  priorities  using  this  heuristic  is  demon¬ 
strated  in  the  problems  in  the  next  section. 

A  priority  discipline  with  priorities  assigned  as  described 
above  is  denoted  by  PPg  for  preemptive  priority  and  NPPg 
for  nonpreemptive  priority.  A  preemptive  priority  discipline 
with  priorities  assigned  in  such  a  way  as  not  to  follow  the 
heuristic  given  above  is  denoted  by  PPp;  processes  that  com¬ 
municate  across  lines  and  bottleneck  processes  are  assigned 
lowest  priority,  and  all  the  other  processes  are  assigned 
highest  priority.  We  have  found  that  PPg  usually  does  better 
than  FCFS,  RRFQ,  PPp,  and  NPPg;  PPp  does  the  poorest. 

t.  Problems 

The  problems  tested  are  a  partial  differential  equation 
aolver  (PDE),  a  centralired  monitor  (MONITOR),  and  a  sys¬ 
tem  of  five  producer-consumer  pairs  (PC's).  For  each  prob¬ 
lem  we  present  a  brief  description  of  the  program  and  a 
figure  that  represents  the  distributed  program,  architecture, 
assignment  of  processes  to  processors,  and  priorities  for  both 
PPg  and  NPPg.  Each  process  is  represented  by  a  circle  with 
the  process  number  in  the  circle;  the  total  instruction  pro¬ 
cessing  time  requirement  per  process  is  given  below  each  cir¬ 
cle.  The  priority  for  a  process  is  given  above  each  circle.  The 
number  and  average  sire  in  message  units  of  messages  sent  at 
the  program  level  between  two  communicating  processes  is 
given  above  each  line  as  the  ordered  pair  (number, sire). 
Values  for  communication  and  processing  time  are  obtained 
by  running  the  program  on  the  testbed  with  any  assignment 
and  architecture;  for  these  program.,  these  quantities  are 
independent  of  the  architecture  and  assignment.  Circles 
enclosed  in  a  box  mean  that  the  enclosed  processes  are 
assigned  to  one  processor.  For  each  problem  the  processors 
are  identical  and  the  virtual  line  time  for  a  message  unit  is 
the  same  between  pairs  of  processes  that  must  communicate 
over  a  line. 

S.l.  Partial  Differential  Equation 

We  solve  Laplace's  partial  differential  equation  on  a  grid 
with  the  outer  edges  of  the  grid  given  as  boundary  condi¬ 
tions  The  iterative  method  used  is  Gauss-Seidel.  The  grid  is 
partitioned  into  subgrids  where  each  subgrid  is  some  number 
of  contiguous  rows.  Each  subgrid  is  solved  by  a  process  in 
the  same  way  a  sequential  program  would  solve  the  entire 
grid  A  grid  value  is  computed  as  the  average  of  its  four  adja¬ 
cent  neighbors;  thus,  to  compute  a  row  of  values,  the  two 
adjacent  rows  are  required,  lienee,  a  process  must  request 
the  two  rows  contiguous  to  its  subgrid  from  its  two  neighbor¬ 
ing  processes.  An  important  property  of  these  processes  is 
that  they  must  remain  closely  synchronised.  No  process  can 
compute  very  far  ahead  because  it  requires  rows  that  cannot 
be  computed  unless  the  other  processes  execute. 


Figure  I  shows  the  structure  of  the  problem  that  runs 
on  two  processors.  The  two  processors  are  connected  by  a 
line  with  virtual  line  time  for  a  message  unit  set  at  £92  time 
units.  In  previous  work  we  found  that  the  assignment  indi¬ 
cated  in  Figure  1  is  best  for  this  architecture  [5j. 

All  processes  are  comparable;  there  is  no  bottleneck  pro¬ 
cess  because  each  process  is  logically  equivalent  and  com¬ 
putes  an  equal  number  of  rows.  Since  each  process  must  exe¬ 
cute  one  time  per  Gauss- Seidel  step  over  the  same  site 
subgrid,  there  is  no  need  to  assign  priorities  to  approximate 
SRTF.  The  two  processes  that  communicate  over  the  line  are 
given  highest  priority.  For  PPg  and  NPPg,  processes  3  and  4 
were  assigned  highest  priority  at  1.0;  the  others  were  assigned 
lower  priority  at  2.0.  For  PPp,  processes  3  and  4  were 
assigned  lowest  priority  at  2.0  and  the  others  were  assigned 
highest  priority  at  1.0  PPg  performed  the  best  of  the  discip¬ 
lines  tested. 

S.t.  Centralised  Monitor 

The  centralised  monitor  consists  of  a  resource  process 
and  three  groups;  each  group  consists  of  a  requester  process 
and  its  three  user  processes.  Each  user  process  executes  some 
given  amount  of  time  and  then  makes  a  request  to  use  the 
resource  through  its  requester  process.  The  requester  process 
passes  the  user  request  on  to  the  resource  process.  This  is 
repeated  20  times  before  a  user  terminates.  The  processing 
times  per  iteration  were  chosen  no  that  (1)  there  is  a  smalt, 
medium,  and  large  processing  user  process  at  each  processor 
and  (2)  the  sum  of  the  processing  time  of  the  users  at  each 
processor  is  approximately  the  same  at  each  processor.  An 
important  property  of  these  processes  is  that  a  user  process 
can  compute  to  termination  even  when  no  other  user  process 
has  executed.  However,  a  user  process  must  share  the 
resource  and  a  requester  process  with  other  user  processes. 

Figure  2  shows  the  structure  of  the  centralised  monitor 
that  runs  on  four  processors.  Processor  4  is  connected 
directly  to  processors  1,  2,  and  3.  Each  line  has  a  virtual  line 
time  of  58  lime  units  for  a  message  unit.  In  previous  work  we 
found  that  the  assignment  indicated  in  Figure  2  is  best  Tor 
this  architecture  [£]. 

The  requester  processes  are  10,  11,  and  12.  A  requester 
process  has  high  priority  because  it  is  a  bottleneck  and  also 
because  it  communicates  over  a  line.  The  user  processes  -  1 
through  9  -  at  each  processor  are  not  identical  because  of 
differing  processing  requirements.  The  user  processes  are 
assigned  priority  using  the  average  processing  time  between 
I/O  statements  to  estimate  CPU  bursts  and  thus  to  approxi¬ 
mate  SRTF.  For  PPg  and  NPPg.  requester  processes  10,  II, 
and  12  get  priority  1.0;  user  processes  I,  4,  and  7  get  priority 
2.0;  user  processes  2,  5,  and  8  get  priority  3.0,  user  processes 
3,  6,  and  9  get  priority  4.0.  For  PPp,  processes  10,  11,  and 
12  get  priority  2.0,  while  all  user  processes  1-9  get  priority 
1.0.  SRTF  is  an  important  component  of  the  priority  discip¬ 
line  because  a  user  process  with  a  small  burst  time  can  finish 
earlier  than  the  others  and  thus  decrease  APTT.  Resource 
process  13  has  priority  I  for  each  priority  discipline.  It  is  the 
only  prorcss  on  its  processor;  thus  the  choice  is  arbitrary  for 
each  priority  discipline.  PPg  performed  the  best  of  the  dis¬ 
ciplines  tested. 

1.1.  Producer-Consumer  Pair* 

There  are  five  producer-consumer  pairs.  Figure  3  shows 
the  structure  of  the  problem  that  runs  on  two  processors. 
The  two  processors  are  connected  by  a  line  with  virtual  line 


time  for  i  message  unit  set  at  340  time  uoits.  Processes  1  to 
5  ue  producers;  processes  0  to  JO  tre  consumers.  Escb  psir  • 
(1,0)  (2,7)  tod  (3,8)  •  bts  one-third  tbe  processing  require¬ 
ment  of  etch  ptir  •  (4,6)  and  (5,10).  Each  producer  sends  40 
messages  to  its  corresponding  consumer.  An  important  pro- 
pert;  of  this  problem  is  thsl  each  producer-consumer  pair 
can  execute  to  termination  independently  of  tbe  other  pain. 

One  pair  of  processes  communicates  over  the  line  and 
both  are  given  highest  priority.  There  are  no  bottleneck 
processes  in  this  example.  The  two  pairs  with  the  large  pro¬ 
cessing  requirements  should  gel  lower  priority  to  approximate 
SRTF.  Priorities  for  PPg  are  assigned  as  follows:  processes  3 
and  8  get  priority  1.0;  processes  1,  0,  2,  and  7  git  priority  2.0; 
processes  4,  8,  5,  and  10  get  priority  3.0.  For  PPp,  processes 
3  and  8  get  priority  2.0;  the  other  processes  get  priority  1.0. 
Since  each  pair  can  terminate  independently  of  the  other 
pain,  one  process  waiting  on  a  line  cannot  cause  all  the 
processes  on  that  processor  to  block  as  can  happen  in  the 
other  two  problems.  For  this  problem  PPg  performed  the 
best  of  the  disciplines  tested. 

A.  Results 

The  results  for  each  program  and  its  architecture  are 
given  in  Table  1.  Of  the  disciplines  tested,  PPg  is  the  best 
while  PPp  is  the  poorest.  RRFQ  always  does  better  than 
FCFS;  this  is  probably  due  to  its  preemptive  characteristic. 
The  nonpreemptive  priority  discipline,  NPPg,  is  poorer  than 
RRFQ  for  both  the  PDE  and  MONITOR  problems.  The  per¬ 
centage  increase  in  APTT  from  PPg  to  PPp  as  computed  by 
(max  APTT  -  min  APTT)  /  (min  APTT)  is  32%  for  PDE, 
49%  for  MONITOR,  and  57%  for  PC's. 

We  have  also  experimented  with  varying  the  virtual  line 
time  and  thus  the  message  delays.  The  same  assignment  of 
processes  to  processors  was  maintained.  The  graph  for  each 
problem  is  given  in  Figures  4-8.  These  graphs  show  some 
conditions  under  which  the  choice  of  queueing  discipline  has 
an  impact. 

7.  Impact  of  Procaaaor  Quaualng  Disciplines 

The  graphs  for  each  problem  show  different  trends.  The 
graph  for  the  PDE  shows  that  at  small  virtual  line  times,  the 
choice  of  queueing  discipline  has  no  impart.  The  graph  for 
the  MONITOR  -hows  that  at  large  virtual  line  times,  the 
choice  has  no  impact.  Tbe  graph  for  the  PC's  shows  that  the 
choice  has  an  impact  for  all  the  virtual  line  times  tested.  In 
order  to  explain  some  of  tbe  trends  in  the  graphs  generated 
by  this  experiment,  a  rationale  was  developed.  It  is  used  to 
explain  the  trends  in  the  graph  where  virtual  line  time  and 
thus  message  delays  vary.  Tbe  rationale  is  a  partial  solution. 

7.1.  Rational* 

The  rationale  is  based  on  two  as-umptions.  (1)  A  good 
heuristic  for  minimising  APTT  is  to  minimise  the  idle  time  at 
each  processor.  (2)  There  are  two  components  of  a  discipline, 
tbe  local  and  global  components,  and  they  vary  in  their  con¬ 
tribution  to  the  scheduling  of  a  processor. 

For  a  simplistic  rationale  one  can  say  two  things.  (1) 
When  virtual  line  time  is  very  small,  the  local  component  of 
the  discipline  is  more  important.  (2)  As  virtual  line  time 
increases,  tbe  global  component  of  the  discipline  becomes 
more  important.  However,  this  is  not  enough.  The  process¬ 
ing  time  must  be  considered. 

A  more  careful  rationale  compares  the  delay  a  process 


can  incur  when  waiting  on  a  message  to  the  remaining  pro¬ 
cessing  to  be  done  at  its  processor  until  all  professes  are 
blocked.  For  a  send  or  receive  statement,  the  delay  D(j)  for 
I/O  variable  j  is  tbe  time  that  tbe  process  is  in  a  wait  state 
for  the  I/O  variable  j.  It  is  at  most  the  sum  of  tbe  virtual 
line  time  to  the  corresponding  processor,  the  virtual  line  time 
from  the  corresponding  processor,  qurueiDg  time  at  the 
appropriate  port  queues,  and  the  processing,  waiting,  and 
queueing  time  of  the  corresponding  process  at  its  processor. 
When  a  process  communicates  with  its  corresponding  process 
on  tbe  same  processor,  the  delay  does  not  include  any  virtual 
line  time  or  port  queueing  time. 

A  function  is  described  that  measures  the  processing 
that  ran  be  overlapped  when  a  process  waits  for  a  message 
Busytime(k,s,t)  is  the  amount  of  processing  time  remaining  at 
time  t  until  all  processes  are  blocked  at  processor  k,  which  is 
scheduled  by  discipline  s.  It  is  a  function  of  the  problem,  tbe 
burst  times  between  communication  statements  in  processes, 
the  processor’s  queueing  discipline,  and  incoming  messages 
from  other  processors.  We  are  interested  in  this  function 
only  at  those  times  when  a  process  enters  a  wait  state. 

Suppose  a  process  enters  a  wait  state  at  time  w  for  I/O 
variable  j  on  processor  k.  If  busylime(k,s,w)  >  D(j),  then 
there  is  no  idle  period  for  the  processor  for  this  communica¬ 
tion.  A  good  global  discipline  should  always  try  to  maintain 
this  inequality  for  each  message  over  a  line  at  all  processors 
to  avoid  idle  periods.  If  busytimr(k,s,w)  <  D(j),  then  an  idle 
period  will  result  from  this  communication.  Note  that  an  idle 
period  can  not  happen  when  two  processes  on  the  same  pro¬ 
cessor  are  ready  to  communicate  with  each  other;  one  process 
or  the  other  can  execute.  Idle  periods  for  a  processor  can 
only  result  from  a  communication  across  a  line  Thus,  for 
each  communication  across  a  line,  the  ratio 
D(j)/busytiror(k,s,w)  is  defined  when  a  process  enters  a  wait 
state  at  time  w. 

For  the  preemptive  priority  discipline,  PPg,  the  average 
R  of  these  ratios  at  a  processor  can  give  us  a  measure  of  how 
busy  a  processor  is  for  a  problem.  R  <  I  implies  that  on  the 
average  a  processor  is  not  idle.  However,  processor  idle 
periods  rannot  be  avoided  when  communication  delays  are 
very  large  compared  to  the  largest  amount  of  processing 
available  at  a  processor  under  any  queueing  discipline. 

If  R<<1  then  for  any  communication  tbe  processor  was 
usually  busy  when  a  process  was  waiting  on  a  message.  If 
R>>1  then  for  any  communication  tbe  processor  became 
idle  most  of  the  time  while  a  process  was  wailing  on  a  mes¬ 
sage.  We  have  assumed  that  for  any  discipline  there  is  a 
loral  and  global  component.  It  seems  reasonable  that  (I) 
when  R<<l,  local  scheduling  is  a  more  important  com¬ 
ponent  since  tbe  processor  is  infrequently  idle  and  (2)  when 
R>>1,  global  scheduling  is  a  more  important  component 
sinre  global  scheduling  is  responsible  for  minimiting  tbe  idle 
periods. 

R  was  estimated  but  not  computed;  thus  tbe  following 
analysis  is  qualitative.  At  earb  processor,  R  changes  as  vir¬ 
tual  line  time  is  varied.  There  are  many  factors  to  consider 
but  generally  we  can  say  that  when  virtual  line  time  increases 
for  all  lines,  each  D(j)  increases  while  each  busytime(k,pp,w) 
does  not  increase,  where  pp  is  tbe  discipline  PPg.  D(j) 
ioereases  because  of  tbe  larger  virtual  line  time. 
Busytime(k,pp,w)  rannot  increase  because  inrociing  messages 
arrive  later.  Thus,  I?  at  earb  processor  increases  as  virtual 
line  time  increases  for  all  lines. 
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,  R  baa  been  defined  for  m  single  processor.  We  can 
extend  the  idea  of  how  the  site  of  R  relates  to  global  and 
local  components  of  scheduling  at  a  single  processor  to  the 
entire  set  of  processors  since  R  increases  for  all  processors  as 
virtual  line  time  increases  for  all  lines.  We  define  a  measure 
R*  for  the  entire  set  of  processors  as  the  average  of  all  R. 

The  graphs  in  Figures  4-0  plot  APTT  versus  virtual 
line  time.  For  the  PDE  and  centralised  monitor  when  virtual 
line  time  varies  over  a  reasonable  range  of  values,  we  have 
estimated  that  R*  varies  from  much  smaller  to  much  larger 
than  1.  For  the  producer-consumer  pairs  program,  R*  could 
not  be  forced  to  vary  because  of  the  disconnected  communi¬ 
cation  structure.  Each  producer-consumer  pair  assigned  to 
the  same  processor  can  execute  until  termination.  Thus  at  a 
processor  It,  busytime(k,pp,w)  is  very  large  when  the  one  pro¬ 
cess  that  must  communicate  over  the  line  enters  a  wait  state 
at  lime  w.  There  are  always  two  producer-consumer  pairs  to 
execute  until  their  termination. 

To  analyte  the  trends  in  these  graphs,  we  look  at  d, 
which  is  defined  as  the  difference  of  the  maximum  APTT  and 
the  minimum  APTT  at  a  given  virtual  line  time.  The 
minimum  APTT  should  correspond  to  a  good  discipline, 
APTTg,  while  th-  maximum  APTT  should  correspond  to  a 
poor  discipline,  APTTp.  Thus,  d  should  give  us  a  bound  on 
bow  queueing  disciplines  can  impact  APTT.  If  d  is  small,  the 
choice  of  queueing  discipline  baa  no  impact  because  all  dis¬ 
ciplines  produce  approximately  the  same  metric  value.  If  d  is 
large,  the  choice  has  an  impact  on  performance  because  the 
good  discipline  and  poor  discipline  produce  metric  values 
that  are  not  close. 

Different  trends  are  observed  in  Figures  4  -  0.  A 
rationale  to  explain  these  trends  is: 

•  For  R* <  < I ,  d  can  be  large  or  small.  The  local  com¬ 
ponent  of  scheduling  is  more  important.  If  aff  processes 
are  comparable  (no  bottleneck  processes  and  each  pro¬ 
cess  has  the  same  approximate  processing  burst),  then 
all  disciplines  are  comparable  and  d  is  smalt.  If  the 
processes  are  different  then  the  discipline  can  make  a 
difference  and  d  is  large. 

•  For  R*=*l.  both  the  local  and  global  components  are 
important.  The  site  of  d  depends  on  the  how  the  prob¬ 
lem  responds  to  the  components. 

•  For  R*>>1,  d  can  be  large  or  small,  and  d  ->  con¬ 
stant.  The  global  component  of  scheduling  is  more 
important  Each  processor  is  mostly  idle  until  a  message 
arrives.  If  only  one  process  is  ready  at  a  CPU  queue  at 
a  time  and  the  processes  order  themselves,  then  d  is  0 
for  large  enough  delays.  This  is  the  case  for  the  central- 
ited  monitor  If  all  the  processes  become  ready  at  a 
CPU  shortly  after  a  message  arrives  for  a  process  L,  then 
running  L  is  important  because  it  communicates  across  a 
line,  d  is  tbe  time  when  L  is  ready  to  run  but  the  other 
processes  are  scheduled  ahead  of  L.  This  time  is  con¬ 
stant  for  large  euougb  delays,  and  thus  d  is  a  constant 
and  can  be  large.  This  is  the  ease  for  the  PDE. 

T.t.  Discussion  of  Graphs 

Each  graph  plots  ArTT  as  a  function  of  virtual  line 
time,  wh-rc  virtual  line  lime  varies  from  50  to  at  most  1200 
lime  units.  Experiments  were  rondurled  outside  this  domain 
but  acre  not  plotted  because  no  additional  information  was 
provided.  Trends  established  at  the  endpoints  rontiuued 
beyond  the  interval  plotted. 


For  the  PDE,  each  process  is  logically  comparable  and 
eaeb  process  works  on  tbe  same  site  subgrid.  For  R*<<1, 
the  order  in  which  processes  are  executed  is  not  very  impor¬ 
tant:  thus  there  is  little  difference  in  the  queueing  diseiplioes 
and  d  is  small.  For  R*>>1,  all  processes  at  a  processor 
block  because  they  are  closely  synchronised  and  cannot 
proceed  until  the  process,  waiting  on  a  message  across  the 
line,  receives  the  necessary  row.  In  Figure  I  for  processor  2, 
this  is  process  4.  Process  4  computes  the  points  in  the  middle 
of  its  subgrid  and  then  communicates  with  its  neighbor  pro¬ 
cess  5  on  the  same  processor.  At  this  point  (1)  process  4  is 
ready  to  begin  the  next  iteration  step  and  communicate  with 
its  neighbor  process  3  on  the  other  processor  again  and  (2) 
process  5  and  in  turn  its  neighbor  0  are  ready  to  run.  d  is  the 
difference  due  to  executing  process  4  first  or  last,  d  depends 
on  whether  or  not  tbe  disciplines  overlap  waits  and  process¬ 
ing.  For  PDE,  d  is  large. 

For  the  MONITOR,  tbe  three  requester  processes  and 
the  resource  process  are  bottleneck  processes.  The  user 
processes  have  different  processing  bunts.  For  R*<<1,  the 
cboire  of  discipline  has  an  impact  and  d  is  large.  For 
R*>>1,  the  user  processes  are  all  blocked  most  of  the  time 
waiting  on  tbe  resource  to  get  and  process  their  requests. 
When  a  message  arrives  for  a  user  process  U,  only  U  is 
unblocked  since  all  the  user  processes  are  independent.  U 
executes  and  sends  a  message  to  the  requester  without  interr¬ 
uption.  Since  ODly  one  process  at  a  time  is  on  the  CPU  queue, 
the  queueing  discipline  never  has  to  make  a  choice;  thus  d  is 
small. 

For  the  PC’s,  the  choice  of  discipline  has  an  impact  over 
the  virtual  line  times  tested.  R*  does  not  vary  over  a  large 
range  for  the  virtual  line  times  tested  because  of  the  discon¬ 
nected  structure  of  the  problem;  there  is  always  a  process 
ready  to  run.  This  keeps  busytime  large  relative  to  the  vir¬ 
tual  line  times  tested;  thus  R*<<1.  Sinre  pairs  differ  in 
tbeir  processing  bursts  it  is  important  to  approximate  SRTF; 
d  is  large  and  thus  the  choice  or  discipline  has  an  impact 

8.  Conclusion 

We  have  presented  the  results  for  five  queuriDg  discip¬ 
lines  tested  on  three  problems.  Tbe  disciplines  tested  are 
first-comc-first-serve,  round-robin-fixrd-quantum, 

nonpreemptive-priority,  and  preemptive-priority  with  two  sets 
of  priorities.  A  heuristic  is  given  to  assign  priorities.  We 
found  that  the  preemptive  priority  discipline  with  priorities 
assigned  according  to  our  heuristic  was  the  best  discipline 
tested.  We  also  found  that  tbe  choice  of  queueing  discipline 
varied  in  its  impact  on  performance.  A  rationale  is  given  to 
predict  when  tbe  choice  of  discipline  has  the  most  impact. 
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Appendix  A 

Tbe  testbed  consists  of  a  compiler  and  a  simulator.  Tbe 
simulator  includes  operating  system  routines,  network  proto¬ 
col  routines,  and  an  interpreter.  Tbe  compiler  produces 
I’code  instructions  (instructions  for  the  hypothetical  proeev 
sor)  for  eaeb  process.  Tbe  simulator  has  two  types  of  events 


•  Interpret  Pcode  instruction*  for  processes  on  processor  i 
until  time  for  the  next  event  or  until  processor  i  bus  no 
processes  to  execute.  We  refer  to  this  event  ns  run  pro¬ 
cessor  i. 

•  message  arrival  at  processor  i. 

Initially,  there  arc  no  message  arrival  events,  and  all  proces¬ 
sor!.  i  that  have  processes  to  execute  are  represented  by  the 
event,  run  processor  i.  If  several  messages  arrive  at  a  proces¬ 
sor  at  the  same  time,  the  messages  are  handled  FCFS 
depending  on  the  simulator's  event  list.  If  all  processors  are 
the  same  speed  and  Pcode  execution  time  is  the  same  for 
most  instructions,  then  a  run  processor  event  will  be  the  exe¬ 
cution  of  exactly  one  Pcode  instruction  at  that  processor. 

The  network  architecture  of  the  testbed  is  based  on  the 
conventional  ISO  OSI  reference  model.  We  simulated  enough 
layers  to  give  a  detailed  model  of  distributed  computing 
without  actually  building  a  system.  We  simulated  the 
language  layer  (application),  transport,  and  a  simplified  net¬ 
work  layer.  Below  the  network  layer,  the  testbed  assumes 
error-free  full-duplex  lines.  This  assumption  is  not  quite  as 
strict  as  it  seems.  The  actual  line  time  can  be  increased  by  a 
random  number  to  approximate  the  time  for  protocol  execu¬ 
tion  and  lower  level  messages  in  the  data  link  and  physical 
layers.  We  defined  this  in  Section  2.3  as  the  virtual  line  time. 

The  language  layer  at  a  processor  provides  the  buflirs 
for  the  messages  that  arrive  at  and  whose  destination  is  that 
processor.  These  message  arrivals  are  passed  directly  from 
the  network  layer  to  the  language  layer,  where  an  uninter- 
ruptable  language  layer  protocol  routine  is  executed. 

The  t-stbed  was  validated  extensively  using  commercial 
analytical  and  simulation  packages.  The  commercial  simula¬ 
tion  package  was  used  to  model  several  problems  and  archi¬ 
tectures  to  validate  detailed  aspects  of  the  simulator.  The 
analytical  package  was  used  to  model  higher  level  aspects  of 
the  testbed. 

The  testbed  provides  confidence  interval  estimates  at 
the  00%  level  with  relative  widths  less  than  0.05  for  various 
performance  measures.  In  this  paper  we  have  reported  only 
the  midpoint  of  the  confidence  interval  for  the  measure, 
APTT  |5). 
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Abstract  —  A  distributed  program  coastals  of  processes,  many  of 
which  caa  execute  concurrently  on  different  processors  ia  a  distri¬ 
buted  system  of  processors.  When  several  processes  from  tbe  same 
or  different  distributed  programs  have  been  assigned  to  a  processor 
ia  a  distributed  system,  the  processor  must  select  the  next  process 
to  ruo.  Tbe  question  investigated  is:  Wbal  is  an  appropriate 
method  for  selecting  tbe  next  process  to  run!  Standard  processor 
queueing  disciplines,  such  as  flrsl-come-flrsl-serve  and  round- 
robin-fixed-quaatum,  are  studied.  Tbe  results  for  four  classes  of 
queueing  disciplines  tested  on  three  problems  are  presented. 
These  problems  were  run  on  a  testbed,  consisting  of  a  compiler 
and  simulator  used  to  run  distributed  programs  on  user-specified 
architectures. 

1.  Introduction 

When  several  processes  from  the  same  or  different  distri¬ 
buted  programs  have  been  assigned  to  a  processor  in  a  distributed 
system,  an  important  design  question  is  how  a  processor  selects 
the  nett  process  to  run.  This  problem  has  not  been  considered  in 
a  distributed  environment.  An  interesting  question  arises:  How  do 
the  processes  at  other  processors  and  communication  delays  in  the 
system  impart  the  selection  of  the  next  process  to  run?  As  a 
beginning  study  we  have  investigated  the  standard  queueing  dis¬ 
ciplines  -  first-come-first-serve,  round-rnbin-flxed-quantum, 
preemptive  priority,  and  nonpreemptive  priority  -  in  a  distributed 
en-  ironment.  The  study  shows  that  the  response  time  metric  can 
differ  by  SOT  with  different  choices  of  queueing  disciplines  for 
three  problems. 

The  queueing  disciplines  were  studied  with  several  problems 
that  represent  three  important  classes  of  problems.  The  partial 
differential  equation  solver  is  based  on  an  iterative  grid  technique 
that  is  similar  to  those  used  in  multidimensional  applications  such 
as  weather  prediction,  structural  mechanics,  hydrodynamics,  heal 
transport,  and  radiation  transport  The  centralued  monitor  has 
the  typical  tree  structure  of  hierarchically  designed  applications 
The  producer-consumer  pairs  represent  a  multiprogramming 
environment  in  the  distributed  system  and  are  representative  of  a 
large  class  of  problems 

In  Section  2  a  model  of  tbe  distributed  architecture  and  tbe 
distributed  laoguage  are  described  The  metric  for  comparing  the 
performance  of  tbe  different  queueing  disciplines  and  a  description 
of  the  testbed  are  given  in  Section  3  In  Section  4  we  give  a 
heuristic  for  assigning  priorities  for  the  priority  dependent  queue¬ 
ing  discip!  ies  Section  5  describes  the  distributed  programs  and 
architectures  on  which  each  problem  executes  The  results  are 
giieo  in  Section  6 

t.  Model  of  Distributed  Computing 

1.1.  Distributed  Architecture 

The  distributed  architecture  is  charartemed  by  the  numbri 
of  processors,  the  speed  of  each  processor,  the  queueing  discipline 
at  each  processor,  and  the  lines  that  connect  the  processors  The 
lines  may  have  different  rapacities,  lengths,  and  error  rates  Tbe 
processors  have  no  shared  memory  and  they  communicate  only  by 
messages.  We  assume  that  any  processor  caa  communicate  with 
any  other  processor  by  routing  messages  through  intermediate 
processors  over  fixed  paths. 

*Prr«»l  sddnn  Compiler  Systems  Groip,  C-l.  Lot  Alamos  Nsliossl  La¬ 
boratory.  Lor  Alamnt,  Now  Mexico  17544 


1.1.  Distributed  Language 

A  program  in  tbe  distributed  language  consists  of  processes 
that  communicate  and  share  data  by  using  messages  The 
language  is  similar  to  CSP,  which  ia  described  in  |2|  The 
language  uses  synchronous  (blocking)  communication  primitives, 
the  sending  process  rnnnot  proceed  until  tbe  receiving  process  is 
ready  to  rereive  the  message.  For  each  message  sent  at  the  pro- 
gram  level,  there  are  two  messages  sent  it  the  protocol  level  that 
implements  the  language.  In  this  language  there  is  a  static 
number  of  processes.  Dynamic  creation  of  processes  is  simulated 
by  a  process  beginning  execution  only  after  some  other  process 
•ends  it  n  message 

1.1.  Terminology 

We  define  virtual  line  time  for  n  message  between  two  pro¬ 
cessors  connected  directly  by  n  line  as  the  product  of  the  actual 
time  to  move  the  message  over  the  line  and  n  constant  derived 
from  line  reliability  aod  the  overhead  of  lower  level  protocob  The 
actual  time  to  move  the  message  over  the  line  is  the  usual  function 
of  message  length  in  message  units  (packets),  number  of  bits  per 
message  unit,  line  capacity,  and  line  length  Virtual  line  time  does 
not  include  the  time  n  message  waits  to  use  the  communication 
subnet.  Virtual  line  time  for  a  message  between  two  processors  is 
tbe  turn  of  the  virtual  line  times  for  the  lines  on  the  route. 
Currently  in  local  area  networks,  lower  level  protocols  executing  in 
the  processors  usually  reduce  the  physical  line  capacity  by  at  least 
n  factor  of  10  for  any  message  |1]  Virtual  line  time  reflects  this 
effective  line  capacity. 

Tbe  melange  delay  of  a  process  for  a  synchronous  commun¬ 
ication  ns  in  CSP  is  a  function  of  virtual  line  time,  queueing  at  the 
port  qururs  on  the  route  in  n  store  and  forward  network,  and  the 
processing,  wailing,  and  queueing  lilne  of  the  corresponding  pro¬ 
cess  at  its  processor  Message  delays  can  be  very  large  compared 
to  a  process's  processing  time  between  communications 

In  the  testbed  1  unit  of  time  caa  be  thought  of  as  I  pa 
For  loral  area  networks  where  processors  are  1  km  apart,  transmis¬ 
sion  rates  of  10  Mbit/s  are  common.  For  n  packet  of  256  bits  it 
takes  approximately  20  pa  to  send  n  packet  over  the  line  With 
the  factor  of  10  or  more  for  lower  level  protocols.  300  time  units  is 
a  reasonable  number  for  virtual  line  time  in  this  model  of  a  local 
area  network 

For  each  problem  in  this  paper,  we  assume  all  the  processors 
have  the  same  speed,  all  lines  are  identical,  and  a  message  unit  i*- 
256  hits  We  also  assume  that  on  any  simulation  run  all  proces¬ 
sors  have  the  same  queueing  discipline  These  assumptions  ate 
insilr  lo  isolate  the  effects  of  tbe  choice  of  queueing  discipline  fr,,m 
other  system  lanable- 

1.  Testbed  and  Metric 

The  metric  for  conq  anng  various  queueing  disciplines  u> 
defined  as  follows  All  the  processes  of  a  distributed  progrini  ar. 
assumed  lo  start  at  time  reri.  T"‘b  prices.  ,  unites  at  so-oe 
time.  t(i)  The  metric  ia  tbe  sum  over  N  processes  of  the  termina¬ 
tion  times  l(i)  divided  by  N,  aod  is  termed  the  average  or  the  pro¬ 
fess  termination  times  (APTT )  ATTT  refieds  bo’i  the  wsiruc 
lion  processing  requirements  of  processes  nod  tbe  message  delass 
Total  lime,  defined  ns  the  maximum  t(i).  is  not  always  a  good 
metric  for  comparing  queueing  disciplines,  because  when  m-stage 
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d*Uy<  are  very  •mail,  total  time  b  comparable  (or  all  queueing  dia- 
cipliaea. 

The  testbed  run*  distributed  programs  coded  ia  the  distri¬ 
buted  language  meotioaed  above,  which  ia  similar  to  CSP.  In 
addition  to  the  distributed  program,  the  testbed  also  requires  a 
specification  of  the  distributed  architecture  The  testbed  consists 
of  a  compiler  interpreter,  and  simulator.  The  compiler  produces 
pseudo-iastructioos  for  the  hypothetical  processors  ia  the  distri¬ 
buted  system.  The  interpreter  executes  the  pseudo-instructions. 
The  simulator  manages  the  interpreter,  processor  queues,  and  port 
queues  and  executes  protocol  routines.  The  simulator  is  based  oa 
the  work  presented  in  [t|  and  was  validated  exleasivety  using  com¬ 
mercial  analytical  and  simulation  packages  J3.SJ. 

4.  Queueing  Disciplines 

The  queueing  disciplines  tested  were  first-come-first-serve 
(FCFS),  round-robin-Gaed-quantum  (RRFQ).  ocapreemptive- 
priorily  (NPP),  nod  preemptive-priority  (PP)  (t|.  The  two  priority 
disciplines  NPP  nod  PP  mail  assign  priorities  to  the  processes,  in 
n  PP  discipline  if  aa  expected  message  arrives  for  a  blocked  pro¬ 
cess  of  higher  priority,  the  blocked  process  preempts  the  currently 
running  process,  la  the  following  discussion  w c  give  a  heuristic  for 
assigning  priorities. 

Generally  wc  have  observed  that  scheduling  a  single  proces¬ 
sor  in  a  distributed  architecture  must  be  analysed  coasidering 
both  the  single  processor  (local  component)  and  the  distributed 
raviroomeat  (global  component).  Our  heuristic  for  assigning 
priorities  is  given  as  follows: 

•  Processes  that  communicate  across  a  line  are  assigned  high 
priority  (highest  priority  whea  message  delays  sre  Urge  siace 
the  global  component  b  more  important) 

s  A  ptocesn  oa  which  nevctal  other  processes  may  wait  (a 
bottleneck  process)  is  assigned  high  priority  (highest  priority 
when  message  delays  arc  small  since  the  local  component  is 
mote  important). 

•  Any  other  processes  are  assigned  lower  priorities  to  approxi¬ 
mate  shorlest-remaining-lime-flrsl  (SRTP)  |4| 

Thus  a  good  priority  discipline  should  generally  give  highest  prior¬ 
ity  to  those  processes  communicating  across  a  line  in  order  to 
minimire  the  processor  idle  periods  nod  thus  to  finish  executing  all 
processes  at  the  processor  sooner  The  discipline  should  be 
preemptive  so  that  messages  over  the  line  can  be  received  by  the 
corresponding  process  ns  quickly  as  possible.  Choosing  priorities 
using  this  heuristic  is  demonstrated  io  the  problems  in  the  oext 
section. 

A  priority  discipline  with  priorities  assigned  as  described 
above  is  drooled  by  PPg  for  preemptive  priority  and  NPPg  for 
oonpreemptive  priority.  A  preemptive  priority  discipline  with 
priorities  assigned  in  such  a  way  aa  not  to  follow  the  heuristic 
ri>cn  above  is  denoted  by  PPp:  processes  that  communicate  across 
lines  and  bottleneck  processes  are  assigned  lowest  priority,  and  all 
the  other  processes  are  assigned  highest  priority.  We  have  found 
that  PI'g  usually  does  better  than  FCFS,  RRFQ,  PPp,  and  NPPg, 
PPp  does  the  poorest. 

I.  Problems 

The  problems  tested  are  n  partial  differential  equation  solver 
(PDF),  a  centralited  monitor  (MONITOR),  and  a  system  of  five 
producer-consumer  pairs  (PC's).  For  each  problem  we  present  a 
brief  description  of  the  program  and  a  figure  that  represents  the 
distributed  program,  architecture,  assignment  of  processes  to  pro¬ 
cessors,  nod  priorities  for  both  PPg  and  NPPg.  Each  process  is 
represented  by  a  circle  with  the  process  number  ia  the  circle,  the 
total  instruction  processing  time  requirement  per  process  b  given 
below  each  circle  The  priority  for  n  process  b  given  above  each 


citcle.  The  number  and  average  site  in  message  units  of  messages 
sent  at  the  program  level  between  two  communicating  processes  is 
given  above  each  line  as  the  ordered  pair  (oumber.size).  V  aluea 
for  communication  and  processing  time  are  obtained  by  running 
the  program  on  the  testbed  with  any  assignment  and  architecture, 
for  these  programs  these  quantities  are  independent  of  the  archi¬ 
tecture  and  assignment.  Circles  enclosed  io  a  box  mean  that  the 
enclosed  processes  are  assigned  to  one  processor.  For  each  prob¬ 
lem  the  processors  are  identical  and  the  virtual  line  lime  (or  n 
message  unit  is  the  same  between  pairs  of  processes  that  must 
communicate  over  a  line. 

t.l.  Partial  Differential  Equation 

Wr  solve  Laplace's  partial  differential  equation  (PDF)  on  a 
grid  with  the  outer  edges  of  the  grid  given  an  boundary  conditions. 
The  iterative  method  used  is  Gauss-Setdel  The  grid  is  partitioned 
into  subgrids  where  each  subgrid  is  some  number  of  contiguous 
rows.  Each  subgrid  is  solved  by  n  process  in  the  same  way  n 
sequential  program  would  solve  the  entire  grid  A  grid  value  is 
computed  as  tbe  average  of  its  four  adjacent  neighbors,  thus,  to 
compute  a  row  of  values,  the  two  adjacent  rows  nre  required 
Hence,  n  process  must  request  the  two  rows  contiguous  to  its 
subgrid  from  its  two  neighboring  processes. 

Figure  1  shows  the  structure  of  the  problem  that  runs  on 
two  processors.  The  two  processors  arc  connected  by  a  line  with 
virtual  line  time  fot  a  message  unit  set  at  time  units.  In  previ¬ 
ous  work  we  found  that  the  assignment  indicated  in  Figure  I  is 
best  for  this  architecture  |5). 

All  processes  are  comparable;  there  is  no  bottleneck  process 
because  each  process  is  logically  equivalent  and  computes  an  equal 
number  of  rows  Since  each  process  must  execute  one  time  per 
Gauss-Seidrl  step  over  tbe  same  sire  subgrid,  there  is  no  need  to 
assign  priorities  to  approximate  SRTF.  The  two  processes  that 
communicate  over  the  line  are  given  highest  priority  For  PPg 
and  NPPg,  processes  3  and  4  were  assigned  highest  priority  at  1.0, 
the  others  were  assigned  lower  priority  at  2  0  For  PPp.  processes 
3  and  4  were  nssigned  lowest  priority  at  2  0  and  the  others  were 
assigned  highest  priority  at  10. 

I.i.  Centralised  Monitor 

The  centralited  monitor  consists  of  a  resource  process  nod 
three  groups,  each  group  consists  of  a  requester  process  and  its 
three  user  processes.  Each  user  process  executes  some  given 
amount  of  time  and  then  makes  a  request  to  use  the  resource 
through  its  requester  process  The  requester  process  passes  the 
user  request  on  to  the  resource  proress  This  is  repeated  20  times 
before  n  user  terminates  Tbe  processing  times  per  iteration  were 
chosrn  so  that  (I)  there  is  a  small,  medium,  and  large  processing 
user  process  at  each  processor  and  (2)  the  sum  of  the  processing 
time  of  thr  users  at  each  processor  is  approximately  the  same  at 
each  processor. 

Figure  2  shows  Ibr  structure  of  tbe  centralized  monitor  that 
runs  on  four  processors.  Processor  4  is  connected  directly  to  pro¬ 
cessor!  I,  2.  and  3.  Facb  line  has  n  virtual  line  time  of  58  time 
units  for  n  message  unit.  In  previous  work  we  found  that  the 
assignment  indicated  io  Figure  2  is  best  for  this  architecture  |5j. 

Tbe  requester  processes  are  10,  11,  nod  12.  A  requester  pro¬ 
cess  has  high  priority  because  it  is  a  bottleneck  and  also  because  it 
communicates  over  a  line.  The  user  processes  -  I  through  0  -  at 
each  processor  are  not  identical  because  of  differing  processing 
requirements.  Tbe  user  processes  nre  assigned  priority  using  the 
average  processing  time  between  I/O  statements  to  estimate  ('FI-' 
bursts  and  thus  to  approximate  SRTF.  For  PPg  nnd  NPPg. 
requester  processes  10,  11,  and  12  get  priority  10,  user  processes  1, 
4,  nnd  7  get  priority  0  0,  user  processes  2,  5,  nnd  8  get  priority  3  0, 
user  processes  3,  fi,  and  9  gel  priority  4  0  For  PPp,  processes  10, 
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II,  end  12  (el  priority  2  0,  while  *11  eeer  processes  1  •  0  get  prior¬ 
ity  1.0.  SRTF  it  an  important  component  of  the  priority  discip¬ 
line  becaaee  a  aaer  proceee  with  a  email  buret  time  caa  finish  ear¬ 
lier  thaa  the  others  aad  thae  decrease  APTT. 

1.1.  Producer-Consumer  Pairs 

There  are  Are  producer-coaeumer  pairs.  Figure  3  ehowe  the 
structure  of  the  problem  that  ruas  oa  two  processors.  The  two 
processors  are  roaarrlrd  by  a  liar  with  virtual  line  time  for  a  mes¬ 
sage  sail  set  at  3t0  time  sails.  Processes  I  to  6  are  producers; 
processes  0  to  10  are  consumers  Each  pair  -  (1,6)  (2,7)  and  (3,8)  - 
has  ooe-tbird  the  procesiial  requirement  of  each  pair  -  (t,9)  sad 
(5,10).  Each  producer  scads  40  messages  to  its  corresponding  con¬ 
sumer. 

Oae  pair  of  processes  communicates  over  the  line  and  both 
are  gives  highest  priority.  There  are  ao  bottleneck  processes  ia 
this  example.  The  two  pairs  with  the  large  processing  require¬ 
ments  should  get  lower  priority  to  approximate  SRTF.  Priorities 
for  PPg  are  assigned  as  follows:  processes  3  and  8  get  priority  1.0; 
processes  1,  0,  2,  aad  7  get  priority  2.0;  processes  4,  0,  5,  aad  10 
get  priority  3.0.  For  PPp,  processes  3  and  8  get  priority  2.0;  the 
other  processes  get  priority  1.0.  The  producer-consumer  pairs  that 
are  not  split  across  two  processors  are  independent  of  each  other. 
These  pairs  caa  termiaate  independently  of  the  other  pairs,  one 
process  waiting  oa  a  line  cannot  cause  all  the  processes  on  that 
processor  to  block  as  caa  happea  ia  the  other  two  problems. 

1.  Results 

The  results  for  each  program  aad  its  architecture  are  gives 
in  Table  1.  Of  (he  disciplines  tested,  PPg  is  the  best  while  PPp  is 
the  poorest.  RRFQ  always  does  better  thaa  FCFS;  this  is  prob¬ 
ably  due  to  its  preemptive  characteristic.  The  aoapreemptive 
priority  discipline,  NPPg.  is  poorer  thaa  RRFQ  for  both  the  PDE 
aad  MONITOR  problems.  The  percentage  increase  in  APTT  from 
PPg  to  PPp  as  computed  by  (max  APTT  -  min  APTT)  /  (mis 
APTTI  is  32%  for  PDE,  49%  for  MONITOR,  aod  S7%  for  PC  s 

7.  Conclusion 

We  have  presented  the  rrsulla  fur  flve  queueing  disciplines 
tested  oa  three  problems.  The  disciplines  tested  are  first-come- 
Irst-serve,  rosad-robia-lsed-quaatum,  aonpreemptive-priority, 
aad  preemptive-priority  with  two  sets  of  priorities.  A  heuristic  is 
given  to  assign  priorities.  We  found  that  the  preemptive  priority 
discipline  with  priorities  assigned  according  to  our  heuristic  was 
the  beat  discipline  tested. 
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The  Drinking  Philosophers  Problem 

K.  M.  CHANDY  and  J.  MISRA 
University  of  Texas  at  Austin 


The  problem  i*f  resolving  conflicts  between  processes  in  distributed  systems  is  ol  practical  importance. 
A  conflict  between  a  set  of  processes  must  be  resolved  in  favor  of  some  (usually  one)  process  and 
against  the  others:  a  favored  prr>ress  must  have  some  properly  that  distinguishes  it  from  others.  To 
guarantee  fairness,  the  distinguishing  property  must  he  such  that  the  process  selected  lor  favorable 
treatment  is  not  always  the  same.  A  distributed  implementation  of  an  acyclic  precedence  graph,  in 
which  the  depth  of  a  process  (the  longest  chain  of  predecessors)  is  a  distinguishing  property,  is 
presented  A  simple  conflict  resolution  rule  coupled  with  the  acyclic  graph  ensures  lair  resolution  of 
all  conflicts  To  make  the  problem  concrete,  two  paradigms  are  presented:  the  well  known  distributed 
timing  philosophers  problem  and  a  generali/at  ton  of  it,  t  he  dist  rihuted  drinking  philosophers  problem 

Categories  and  Subject  Descriptors:  D.l.d  [  Programming  Techniques)  Concurrent  Programming; 
1  >  4.1  (Operating  Systems):  Process  Management  cu/icu/tc/icv;  mutual  ni  lusum:  suuhrtumuftnu, 
1)  4.7  [Operating  Systems):  Organization  and  Design-  distributed  systems 

(General  Terms:  Algorithms 

Additional  Key  Words  and  Phrases:  Asymmetry,  dining  philosophers  problem 


1.  INTRODUCTION 

We  study  the  problem  of  fair  conflict  resolution  in  distributed  systems.  Conflicts 
can  be  resolved  only  if  there  is  some  property  by  which  one  process  in  every  set 
of  conflicting  processes  can  be  distinguished  and  selected  for  favorable  treatment; 
that  is,  a  conflict  is  resolved  in  favor  of  the  distinguished  process.  In  order  to 
guarantee  fairness,  the  distinguishing  property  must  be  such  that  the  process 
selected  for  favorable  treatment  is  not  always  the  same.  Traditional  schemes  for 
fair  conflict  resolution  use  priorities  assigned  to  processes  [2,  3,  7,  9,  10|  or 
probabilistic  selection  (5,  8].  We  propose  a  new  approach  by  using  the  locations 
of  shared  resources  as  a  distinguishing  property.  By  introducing  auxiliary  re¬ 
sources,  where  needed,  and  by  judiciously  transferring  resources  among  processes, 
we  show  that  all  conflicts  can  be  resolved  fairly.  We  propose  a  paradigm,  the 
drinking  philosophers  problem,  which  captures  the  essence  of  conflict  resolution 
problems  in  distributed  systems.  This  problem  is  a  generalization  of  the  classical 
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dining  philosophers  problem  [2,  3).  We  present  both  problems  formally  in  tin- 
following  sections.  This  section  presents  an  informal  introduction  to  the  problem 
of  conflict  resolution  in  distributed  systems. 

Two  or  more  processes  cannot  execute  certain  actions  simultaneously:  Ibr 
instance,  two  processes  cannot  hold  “write  locks”  on  the  same  data  item  at  the 
same  time.  Conflicts  arise  when  two  or  more  processes  attempt  to  carry  out  such 
actions  simultaneously.  The  resolution  of  such  a  conflict  requires  that  some 
processes  be  treated  differently  from  others  in  the  sense  that  the  conflict  be 
resolved  in  favor  of  some  processes  and  against  other  conflict  ing  processes.  If  all 
processes  in  a  set  of  conflicting  processes  are  indistinguishable  (i.e.,  if  every 
property  that  holds  for  one  process  also  holds  for  the  others),  then  it  is  impossible 
to  resolve  conflicts  between  them  without  resorting  to  random  selection.  This  is 
because  any  deterministic  algorithm  that  selects  one  of  the  processes  for  favorable 
treatment  must  carry  out  the  selection  on  the  basis  of  some  property  that  holds 
for  that  process  and  not  for  the  others.  Therefore,  if  we  do  not  wish  to  use 
probabilistic  algorithms  to  resolve  conflicts,  we  must  ensure  the  following  invar 
iant: 

Distinguishability .  In  every  state  of  the  system  at  least  one  process  in  every 
set  of  conflicting  processes  must  be  distinguishable  from  the  other  processes  of 
the  set. 

An  example  of  a  distinguishing  property  is  a  process’s  identity  number  (pro¬ 
vided  that  it  is  different  from  the  identity  numbers  of  all  processes  that  it  may 
conflict  with). 

Fairness.  Usually  we  require  not  only  that  conflicts  be  resolved  but  also  that 
they  be  resolved  fairly,  that  is,  conflicts  should  not  always  be  resolved  to  the 
detriment  of  a  particular  process.  If  conflicts  always  occur  in  the  same  system 
state,  a  deterministic  conflict  resolution  scheme  will  always  resolve  conflicts  in 
the  same  way  because  the  outcome  of  a  deterministic  scheme  is  a  function  of  the 
system  state.  In  this  case  conflict  resolution  will  be  unfair.  Fairness  lequires  that 
the  states  that  obtain  when  conflicts  occur  not  always  be  identical.  An  example 
of  state  information  used  to  ensure  that  conflicts  arise  in  different  system  states 
is  time,  where  time  may  be  determined  by  a  centralized,  global  clock  or  by 
distributed  logical  clocks  [7):  every  request  (which  may  result  in  a  conflict)  is 
timestumped,  and  a  conflict  between  two  requests  is  resolved  in  favor  of  the  one 
with  the  smaller  timestamp.  However,  conflicts  between  processes  with  equal 
timestamps  must  be  resolved  by  using  some  other  distinguishing  property  (such 
as  process  IDs).  The  state  information  used  to  ensure  fairness  may  reside  in  a 
single  process  (the  centralized  solution)  or  it  may  be  distributed.  This  paper  is 
about  distributed  schemes  to  ensure  (1)  distinguishability  and  (2)  fairness. 

We  describe  our  problem  informally  by  using  a  graph  model  of  conflict  .  A 
distributed  system  is  represented  by  an  undirected  graph  G  with  a  one-to-one 
correspondence  between  vertices  in  G  and  processes  in  the  system.  Edge  (u,  v) 
exists  in  G  if  and  only  if  there  may  be  a  conflict  between  (the  processes 
corresponding  to)  vertices  u  and  v.  We  assume  that  there  is  some  mechanism 
that,  in  every  state  of  the  system,  ascribes  a  precedence  ordering  to  every  pair  of 
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potentially  conflicting  processes  so  that  one  of  the  processes  in  the  pair  has 
precedence  over  the  other.  If  there  is  a  conflict  between  a  pair  of  processes,  the 
process  with  the  lower  precedence  must  yield  to  the  process  with  greater  prece¬ 
dence  in  finite  time,  We  represent  precedences  between  pairs  of  potentially 
conflicting  processes  by  a  precedence  qrnph  11,  which  is  a  graph  identical  to  G 
except  that  each  edge  in  G  is  gi%’en  a  direction  in  H  as  follows:  An  edge  ii:  11  is 
directed  away  from  the  process  with  greater  precedence  toward  the  process  with 
lesser  precedence.  For  example,  Figure  1  shows  graph  G  for  a  system  with  3 
processes  p,  q,  and  r  with  the  possibility  of  conflict  between  any  pair  of  processes. 
Figure  2  shows  graph  11  for  a  state  of  the  system  in  which  p  has  precedence  over 
q  and  r,  and  q  has  precedence  over  r. 

If  H  is  acyclic,  then  the  depth  of  a  process  in  H  is  a  distinguishing  property  by 
which  a  process  can  be  distinguished  from  all  processes  that  it  may  conflict  with; 
depth  of  a  process  p  in  H  is  the  maximum  number  of  edges  on  any  (directed) 
path  to  p  from  a  process  without  any  predecessors.  Note  that  a  process  with  no 
predecessor  has  depth  0.  It  follows  that  neighbors  cannot  have  the  same  depth. 
For  example,  in  Figure  2,  the  depth  of  p,  q,  and  r  are  0,  1,  and  2,  respectively. 

If  H  is  a  cycle,  the  topology  of  H  does  not  allow  us  to  distinguish  one  process 
from  another.  We  propose  an  algorithm  that  ensures  that  H  is  acyclic  in  every 
state  of  the  system. 

The  acyclicity  of  H  in  every  state  of  the  system  guarantees  distinguishahility 
but  does  not  guarantee  fairness.  We  wish  to  ensure  that  every  process  with 
conflicts  has  all  its  conflicts  resolved  in  its  favor  in  finite  time;  this  requirement 
can  be  ensured  by  a  guarantee  that  every  process  with  conflicts  rises  to  the  top 
(i.e.,  to  zero  depth),  in  H  in  finite  time.  By  the  phrase,  a  “process  p  will  rise  to 
the  top  in  H,"  we  mean  that  the  state  of  the  system  will  change,  and  hence  H 
will  change  too,  so  that  p  will  have  no  predecessor  in  the  precedence  graph  H  at 
some  later  state.  If  p  is  at  depth  0,  then  any  conflict  that  p  has  will  he  resolved 
in  p’s  favor  in  finite  time  because  p  takes  precedence  over  all  of  its  neighbors. 

How  should  11  change?  The  only  way  to  change  //  is  by  changing  the  directions 
of  the  edges.  We  propose  to  implement  //,  and  changes  to  11,  by  a  distributed 
scheme,  where  each  change  in  H  is  made  locally  at  one  process.  Therefore  our 
requirements  are  (1)  11  remains  acyclic  at  all  times,  (2)  11  changes  in  such  a 
manner  that  every  conflicting  process  eventually  rises  to  the  top  in  H,  and  (3) 
each  change  to  H  be  done  locally  at  a  process. 
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rl'o  ensure  acyclicity,  we  employ  the  following  rule  for  changing  //: 

Acyclicity  Rule.  All  edges  incident  on  a  process  p  may  he  simultaneously  (i.e., 
in  one  atomic  action)  redirected  toward  p. 

This  transformation  preserves  acyclicity  of  H  because  no  cycle  containing  p 
can  be  created  by  the  transformation  since  there  is  no  edge  directed  away  from 
p  after  the  transformation. 

To  ensure  that  every  process  in  a  conflict  will  rise  to  the  top  in  II  eventually 
we  employ  the  following  rule: 

Fairness  Rule.  Within  finite  time  after  a  conflict  is  resolved  in  favor  of  a 
process  p  at  depth  0,  p  must  yield  precedence  to  all  its  neighbors. 

7'his  ensures  that  in  the  event  that  process  at  depth  0  is  in  conflict  it  will 
redirect  all  incident  edges  toward  itself  in  finite  time.  This  redirection  of  edges 
follows  the  acyclicity  rule. 

Example.  Consider  the  precedence  graph  H  shown  in  Figure  3a,  where  p,  </, 
and  r  have  depth  0,  1,  and  2,  respectively.  If  there  are  conflicts,  then  in  finite 
time  the  directions  of  all  edges  incident  on  p  will  he  reversed  to  give  the  precedence 
graph  shown  in  Figure  3b,  in  which  p,  (/,  and  r  have  depth  2, 0,  and  1,  respectively. 
If  conflicts  persist,  in  finite  time  the  directions  of  all  edges  incident  on  q  will  be 
reversed  to  give  the  precedence  graph  in  Figure  3c,  in  which  p,  </,  and  r  have 
depth  1,  2,  and  0,  respectively. 

The  key  issue  is  to  devise  a  distributed  implementation  of  II,  as  well  as  the 
acyclicity  and  fairness  rules.  The  distributed  aspect  of  the  problem  makes  it 
nontrivial.  The  difficulty  is  that  a  process  has  to  decide  whether  to  yield  or  not 
to  yield  in  a  conflict,  and  the  decision  has  to  he  made  solely  on  the  basis  of  the 
process’s  local  state.  It  may  not  be  possible  to  determine  the  direction  of  edges 
incident  on  a  process  only  on  the  basis  of  the  process's  local  state.  Therefore  we 
devise  a  distributed  implementation  of  H  and  a  scheme  by  which  processes 
resolve  conflicts  by  making  local  decisions  based  on  partial  information  of  //. 

Our  goal  in  this  section  was  to  discuss  the  concepts  underlying  distributed 
conflict  resolution  and  the  treatment  has  been  informal.  The  following  sections 
offer  a  more  forma!  treatment  of  conflict  resolution  by  defining  and  solving  a 
specific  problem:  The  drinking  philosopher  problem,  which  serves  as  a  paradigm 
of  conflict  resolution  problems. 
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2.  THE  DRINKING  PHILOSOPHERS  PROBLEM  (DRINKERS  PROBLEM) 

The  following  problem  is  a  generalization  of  the  (lining  philosophers  problem 
(2,  3|,  which  has  achieved  the  status  of  legend,  since  it  captures  the  essence  of 
many  synchronization  problems.  Processes,  called  philosophers,  are  placed  at  the 
vertices  of  a  finite  undirected  graph  (!  with  one  philosopher  at  each  vertex.  A 
philosopher  is  in  one  of  II  states:  (1)  tranquil,  (2)  thirsty,  or  fill  drinking. 
Associated  with  each  edge  in  (!  is  a  bottle.1  A  philosopher  can  drink  only  from 
bottles  associated  with  his  incident  edges.  A  tranquil  philosopher  may  become 
thirsty.  A  thirsty  philosopher  needs  a  nonempty  set  of  bottles  that  he  wishes  to 
drink  from.  He  may  need  different  sets  of  bottles  for  different  drinking  sessions. 
On  holding  all  needed  bottles,  a  thirsty  philosopher  starts  drinking;  a  thirsty 
philosopher  remains  thirsty  unt  il  he  gets  all  bottles  he  needs  to  drink.  On  entering 
the  drinking  state  a  philosopher  remains  in  that  state  for  a  finite  period,  after 
which  he  becomes  tranquil.  A  philosopher  may  be  in  the  tranquil  state  for  an 
arbitrary  period  of  time. 

Two  philosophers  are  neighbors  if  and  only  if  there  is  an  edge  between  them 
in  G.  Neighbors  may  send  messages  to  one  another.  Messages  are  delivered  in 
arbitrary  but  finite  time.  Resources,  such  as  bottles,  are  also  encoded  and 
transmitted  as  messages. 

The  problem  is  to  devise  a  nonprobabilistic  solution  that  satisfies  the  following 
constraints. 

Fairness.  No  philosopher  remains  thirsty  forever. 

Symmetry.  All  philosophers  obey  precisely  the  same  rules  for  acquiring  and 
releasing  bottles.  There  is  no  priority  or  any  other  form  of  externally  specified 
static  partial  ordering  among  philosophers  or  bottles. 

Economy.  A  philosopher  sends  and  receives  a  finite  number  of  messages 
between  state  transitions.  In  particular,  permanently  tranquil  philosophers  do 
not  send  or  receive  an  infinite  number  of  messages. 

Concurrency.  The  solution  does  not  deny  the  possibility  of  simultaneous  drink¬ 
ing  from  different  bottles  by  different  philosophers. 

Boundedness.  The  number  of  messages  in  transit,  at  any  time,  between  any 
pair  of  philosophers  is  bounded.  Furthermore,  the  size  of  each  message  is  hounded. 

The  drinkers  problem  is  a  general  paradigm  for  modeling  conflicts  between 
processes.  Neighboring  philosophers  will  be  prevented  from  drinking  simultane¬ 
ously  if  they  wish  to  drink  from  the  same  bottle-— this  situation  models  conflicts 
for  exclusive  access  to  a  common  file.  Neighboring  philosophers  may  drink 
simultaneously  from  different  bottles — this  situation  models  processes  writing 
into  different  files. 

We  must  prevent  the  system  from  entering  states  in  which  neighboring 
philosophers  are  indistinguishable.  For  example,  consider  philosophers  arranged 
in  a  ring  and  a  state  in  which  each  philosopher  is  drinking  from  his  “left”  bottle- 
philosophers  cannot  be  distinguished  in  this  state.  If  all  philosophers  are  drinking 
from  their  left  bottles  and  then  require  both  bottles  for  their  next  drinking 

'  The  solution  given  in  this  paper  also  applies  to  multiple  1x4 ties  on  every  edge.  The  assumption  of 
one  Ixittle  per  edge  is  made  for  brevity  in  exposition. 
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session,  then  the  philosophers  must  remain  thirsty  forever  because  a  deiernun 
is  tic  algorithm  cannot  choose  between  indistinguishable  philosophers.  However, 
a  system  state  is  certainly  possible  in  which  all  philosophers  hold  their  left 
bottles.  If  we  were  to  disallow  this  state,  we  would  be  disallowing  a  feasible  slate 
in  which  progress  is  being  made,  merely  to  solve  our  problem;  disallowing  feasible 
states  violates  our  constraint  of  concurrency.  We  appear  to  be  in  a  quandary 
because  the  constraints  of  symmetric  processes,  nonprohahilistic  solutions,  and 
concurrency  are  incompatible  for  this  problem.  The  solution  is  to  implement 
precedence  graph  H  by  using  special  “auxiliary"  resources.  The  solution  to  the 
dining  philosophers  problem  shows  us  how  to  implement  //.  Therefore  we  study 
the  dining  philosophers  problem  next.  W'e  then  study  the  drinkers  problem  using 
the  diners  problem  solution  to  implementing  H. 

з.  THE  DINING  PHILOSOPHERS  PROBLEM  (DINERS  PROBLEM) 

The  diners  problem  [2]  is  a  special  case  of  the  drinkers  problem  in  which  every 
thirsty  philosopher  needs  bottles  associated  with  all  its  incident  edges  for  all 
drinking  sessions.  We  present  a  solution  for  this  problem  with  the  properties  of 
fairness,  symmetry,  economy,  concurrency,  and  boundedness.  To  distinguish 
between  these  two  problems,  we  use  the  following  terms  for  the  diners  problem, 
with  the  corresponding  term  for  the  drinkers  problem  in  parentheses:  thinking 
t tranquil ),  hungry  (thirsty),  rating  (drinking),  fork  ( bottle ).  The  diners  problem 
disallows  neighbors  from  eating  simultaneously.  The  drinkers  problem  allows 
neighbors  to  drink  simultaneously  provided  that  they  are  drinking  from  different 
bottles. 

We  first  present  an  informal  outline  of  the  solution;  the  next  section  has  a 
detailed  formal  description.  A  fork  is  either  clean  or  dirty.  A  fork  being  used  to 
eat  with  is  dirty  and  remains  dirty  until  it  is  cleaned.  A  dean  fork  remains  clean 
until  it  is  used  for  eating.  A  philosopher  cleans  a  fork  when  mailing  it  (he  is 
hygienic).  A  fork  is  cleaned  only  when  it  is  mailed.  Alt  eating  philosopher  does 
not  satisfy  requests  for  forks  until  he  has  finished  eating.  The  key  issue  is:  which 
requests  should  a  noneating  philosopher  defer?  In  our  algorithm,  a  noneating 
philosopher  defers  requests  for  forks  that  are  clean  and  satisfies  requests  for 
forks  that  are  dirty. 

This  solution  to  the  diners  problem  suggests  an  implementation  of  precedence 
graph  H.  The  direction  of  an  edge  between  two  neighbors  u  and  v  is  from  u  to  <> 
(i.e.,  u  has  precedence  over  v)  if  and  only  if  (1)  u  holds  the  fork  shared  by  u  and 

и,  and  the  fork  is  clean,  or  (2)  v  holds  the  fork,  and  the  fork  is  dirty,  or  (3)  the 
fork  is  in  transit  from  v  to  u.  Observe  that  the  direction  (from  u  to  i>)  of  the  edge 
can  change  only  when  u  starts  eating.  Furthermore,  ail  edges  incident  on  an 
eating  philosopher  are  directed  toward  it.  Hence  we  have  an  implementation  of 
the  acyclicity  rule:  The  direction  of  edges  incident  on  a  process  may  be  changed 
only  in  the  following  way — all  edges  incident  on  a  process  may  he  simultaneously 
directed  toward  it. 

Initially  all  forks  are  dirty  and  are  located  at  philosophers  in  such  a  way  that 
H  is  acyclic.  Hence  the  following  is  an  invariant:  H  is  acyclic. 

Immediately  upon  completion  of  an  eating  session,  a  philosopher  yields  prec¬ 
edence  to  his  neighbors.  A  hungry  philosopher  at  depth  0  in  H  will  commence 
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eating  in  finite  time  (because  he  has  precedence  over  all  his  neighbors).  Hy 
induet  inn  on  depth,  a  hungry  philosopher  at  depth  A,  A  >  0,  will  eat  in  finite  time 
I  Krause  he  has  precedence  over  all  philosophers  at  greater  depth,  and  all  plains 
ophers  at  smaller  depth  will  yield  precedence  to  it  in  finite  time. 

A  formal  treatment  of  these  arguments  is  found  in  the  next  sec  tion. 

4.  A  HYGIENIC  SOLUTION  TO  THE  DINERS  PROBLEM 

4  1  Algorithm 

We  now  give  a  precise  description  of  the  solution  outlined  in  the  last  section.  To 
simplify  our  description,  we  introduce  a  request  token  for  each  fork.  Only  the 
holder  of  the  request  token  for  fork  /  can  request  fork  /.  A  request  for  a  fork  is 
made  by  sending  the  corresponding  request  token  to  the  holder  of  the  fork.  It 
follows  then  that  only  one  process-  the  holder  of  the  request  token  for  /—may 
request  fork  f  and  the  requested  process,  after  receiving  the  token,  has  the  next 
chance  to  request  the  fork.  Also,  if  a  process  holds  a  fork  and  the  request  token 
for  the  fork  then  his  neighbor  (with  whom  he  shares  the  fork)  has  an  outstanding 
request  for  the  fork.  We  introduce  the  following  Boolean  variables: 

fork.Af):  philosopher  u  holds  fork  /, 

reqfjf):  philosopher  u  holds  the  request  token  for  fork  /, 

dirty „(/):  fork  /  is  at  u  and  is  dirty, 

thinking  J  hungry  Jeatingu;  philosopher  u  is  thinking/ hungry /eating. 

We  drop  the  subscripts  front  the  Boolean  variables  when  the  context  is  clear. 
Each  philosopher  follows  the  rules  given  below  for  requesting  and  sending 
forks.  In  each  case  a  rule  is  written  as  g  — »  A ,  where  g  is  a  condition  and  A  is  a 
sequence  of  actions.  These  rules  constitute  our  solution  to  the  diners  problem. 
The  set  of  rules  forms  a  single  guarded  command  j4j. 

(R1 )  Requesting  a  fork  /: 

hungry,  rrqf(f),  -  fork(f)  — * 

send  request  token  for  fork  /  (to  the  philosopher  with  whom  f  is  shared); 
rcqfif )  :=  false 

t  R2 )  Releasing  a  fork  /: 

-  eating,  reqf(f),  dirty  ( / )  — 

send  fork  f  (to  the  philosopher  with  whom  fork  f  is  shared); 
dirty {f )  :=  false; 
fork{ /)  false 

(R3)  Receiving  a  request  token  for  /: 

upon  receiving  a  request  for  fork  f  — * 
reqf(f)  :=  true 

(Rl)  Receiving  a  fork  f: 

upon  receiving  fork  f  — 
forh(f)  :=  true 
|  -  dirty  ( /)  I 

We  note  that  the  statement  of  the  diners  problem  defines  transitions  among 
states  ( thinking ,  hungry,  eating)  for  a  philosopher,  and  we  furthermore  have  for 
any  philosopher, 

eating,  forkif)  =*  dirty  if). 
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Initial  Conditions 

1.  Ail  forks  are  dirty.  |  V/,  dirty,.{f)  or  dirty ,  (/)  where  u,  v  are  the  philosophers 
who  can  use  fork  /  ). 

2.  Every  fork  /  and  request  token  for  / are  held  by  different  philosophers.  [If  fork 
/  is  shared  between  philosophers  (it,  c),  then  u  holds  the  fork  and  r  the  token 
(i.e.,  forku(f),  reqf,(f ),  ~-fork,(f),  ~reqf„(f)),  or  v  holds  the  fork  and  it  t In¬ 
to  k  e  n .  j 

3.  H  is  acyclic.  (The  forks  are  initially  placed  in  such  a  manner  that  //  is  acyclic.) 
4.2  Proof  of  the  Hygienic  Solution  for  the  Diners  Problem 

We  show  in  this  section  that  every  hungry  philosopher  will  eat.  In  addition  to 
this  fairness  condition,  we  show  that  our  solution  has  the  properties  of  symmetry, 
economy,  concurrency,  and  boundedness. 

Fairness 

I.kmma  1.  H  is  always  acyclic. 

Proof.  Initially  H  is  acyclic.  The  directions  of  edges  in  H  may  be  affected 
only  when  a  fork  changes  its  status  (dirty  or  clean)  or  its  location.  We  will  show 
that  every  change  to  H  preserves  acyclicity.  Every  transmission  of  a  fork  is 
accompanied  by  a  change  in  its  status  from  dirty  to  clean;  this  does  not  change 
the  direction  of  any  edge.  A  fork  is  dirtied  when  the  philosopher  u  holding  it, 
oats.  In  this  case  u  must  be  holding  all  other  forks  associated  with  edges  incident 
upon  it,  and  they  must  all  be  dirty,  u  cannot  then  create  a  cycle  in  //  because  all 
edges  upon  u  are  directed  toward  it.  □ 

THEOREM  2.  Every  hungry  philosopher  eats. 

The  following  proof  is  based  on  the  fact  that  a  hungry  philosopher  requesting 
a  fork  that  is  currently  dirty  will  receive  it  (from  rule  R2),  and  since  the  fork  is 
clean  upon  receipt  the  philosopher  will  hold  it  until  he  eats.  A  philosopher 
requesting  a  fork  that  is  clean  must  make  the  request  to  a  philosopher  at  a 
smaller  depth  and,  by  induction  on  depth,  this  philosopher  will  eat  and  then 
dirty  the  fork,  in  which  case  the  first  argument  applies. 

PROOF.  Recall  that  the  depth  of  a  philosopher  in  //  is  the  maximum  number 
of  edges  along  a  path  to  that  philosopher  from  one  without  predecessors.  We 
prove  the  theorem  by  induction  on  depth  of  a  hungry  philosopher;  the  induction 
amounts  to  showing  that  hungry  philosophers  at  depth  k  in  every  //  eat,  provided 
all  hungry  philosophers  at  depths  smaller  than  k  in  every  II  eat.  for  all  k  2-  0. 

We  will  not  do  a  special  analysis  for  hungry  philosophers  at  depth  0,  because 
that  case  is  subsumed  by  Case  1,  helow. 

Let  u,  v  be  neighbors  and  u  be  hungry.  We  show  that  u  holds  or  will  hold  the 
fork  /  corresponding  to  the  edge  {u,  v)  and  will  thereafter  continue  to  hold  it 
until  u  eats.  If  u  holds  the  fork  currently  and  holds  it  continuously  until  he  eats, 
the  result  is  trivial.  Therefore  assume  that  r  holds  the  fork  /  sometime  before  u 
eats  next.  We  do  a  case  analysis  on  the  status  of  /  at  the  time  that  v  holds  the 
fork.  At  this  time  we  have:  hungry,,,  -fork„(f),  fork,  (/). 
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Case  1:  /  is  dirty  {dirty,  if)  -  true).  If  reqf„(f )  holds  then  u  will  request  / 
(because  precondition  of  rule  Hi  will  hold)  and  subsequently  reqf,  (/)  will  hold; 
otherwise  reqf,  ( / )  already  holds.  If  eat  in  ft,  holds  then  at  some  later  point  (since 
eating  is  finite),  ~  eating,  holds,  ami  all  other  predicates  for  rule  K2  still  hold. 
Therefore  rule  H2  will  be  applied  bv  r,  and  u  will  eventually  hold  a  clean  fork,  u 
will  not  release  a  clean  fork  until  u  eats. 

Case  2:  f  is  dean  {dirty,  (/  )  =  false).  Kvery  fork  held  by  a  nonhungry  philos¬ 
opher  is  dirty  because 

la)  all  forks  are  dirty  initially, 

(b)  only  hungry  philosophers  receive  dean  forks,  and 

( c)  all  forks  held  by  eating  philosophers  are  dirty. 

Since  /  is  clean,  the  philosopher  t  holding  it  must  Ik*  hungry.  Furthermore, 
because /is  clean,  (e,  u)  is  an  edge  in  //  and  hence  dc[>th(v)  <  depth(u).  According 
to  the  induction  hypothesis,  e  eats  and  lienee  dirties/.  Case  1  then  applies.  □ 

Symmetry.  It  follows  from  the  description  of  the*  algorithm  that  all  philoso 
phers  follow  the  same  rules. 

Economy.  The  number  of  message  sends  and  receives  Ik* fore  a  state  transition 
is  bounded  as  follows:  if  d  is  the  number  of  neighlKirs  of  a  philosopher,  then  no 
more  than  d  requests  or  forks  will  be  sent  or  received.  More  precisely,  suppose  a 
philosopher  has  e  dirty  forks  when  he  transits  to  hungry  state.  Then  he  must 
send  d  -  e  requests  and  receive  a  fork  corresponding  to  each  request.  In  addition, 
in  the  worst  case,  he  may  lose  all  e  forks  he  had  held  initially  and  therefore  have 
to  request  and  receive  them.  Assume  that  a  philosopher  implements  the  latter 
situation  by  sending  a  fork  and  the  request  for  it  in  one  message.  Then  no  more 
than  2d  messages  are  needed  before  transiting  to  the  eating  state.  The  only 
messages  received  in  the  eating  or  thinking  state  are  the  requests  lor  forks  held 
by  t  he  philosopher  and  hence  these  do  not  exceed  d.  In  the  best  case,  a  philosopher 
with  permanently  thinking  philosophers  as  ncightmrs  will  receive  no  requests  for 
forks  and  therefore  may  live  a  life  (think  and  eat)  free  of  interaction  with  others. 

Concurrency.  Our  solution  docs  not  deny  any  feasible  system  state;  that  is, 
any  state  of  the  system  in  which  neighboring  philosophers  are  not  eating  is 
allowable  in  our  solution.  This  is  because  the  solution  does  not  prevent  a 
philosopher  from  entering  the  thinking  or  hungry  state;  the  only  restriction  is  in 
entering  the  eating  state,  and  that  is  allowable  when  a  hungry  philosopher  holds 
all  forks,  as  required  by  the  problem. 

boundedness.  There  are  at  most  two  messages  a  fork  and  a  request  for  a 
fork— in  transit,  between  any  two  philosophers. 

5.  A  SOLUTION  TO  THE  DRINKERS  PROBLEM 

5.1  The  Precedence  Graph 

Our  solution  to  the  drinkers  problem  uses  precedence  graphs  discussed  in  Section 
1.  The  solution  to  the  diners  problem  demonstrates  a  distributed  implementation 
of  the  precedence  graph  H.  Fairness  and  the  acyclicity  of  //  are  ensured  by 
implementation  of  the  fairness  and  acyclicity  rules.  It  may  appear  that  H  provides 
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a  simple  resolution  mechanism  for  any  type  of  conflict,  including  conflict  lor 
bottles  in  the  drinkers  problem,  since  any  conflict  can  he  resolved  in  favor  of  the 
process  with  greatest  precedence.  However,  there  is  a  difficulty  due  to  the 
distributed  implementation  of  //.  (liven  only  the  state  of  process  u  we  can 
determine  which  of  neighbors  u  or  v  has  precedence  if  u  holds  the  /or/"  If  t la- 
fork  is  clean  u  has  precedence,  if  it  is  dirty  v  has  precedence.  However,  if  it  does 
not  hold  the  fork  we  cannot  determine  which  of  u  or  e  has  precedence  from  the 
state  of  u  alone.  In  this  case  u  must  make  local  decisions  about  holding  on  to  or 
releasing  bottles  without  using  precedence  graph  II.  This  issue  is  discussed  next 
in  the  context  of  the  drinkers  problem. 

We  use  forks  to  implement  //.  Forks  are  auxiliary  resources  in  the  sense  that 
their  sole  purpose  is  to  implement  precedence  grapli  II.  Forks  are  not  part  of  the 
drinkers  problem  specification;  they  are  part  of  the  solution.  The  real  resources 
in  the  drinkers  problem  are  bottles.  Our  philosophers  can  eat  and  drink  simul¬ 
taneously,  and  we  emphasize  that  eating  is  an  artifact  of  our  solution,  used  only 
to  guarantee  fair  drinking.  In  our  solution,  the  state  of  a  philosopher  is  a  pair 
(diner’s  state,  drinker's  state),  where  a  diner  s  state  is  one  of  thinking,  hungry, 
or  eating  and  a  drinker's  state  is  one  of  tranquil,  thirsty,  or  drinking.  Our  next 
step  is  to  define  the  dining  characteristics  of  our  philosophers;  the  drinking 
characteristics  are  specified  by  the  problem.  We  give  rules  that  ensure  that  al! 
thirsty  philosophers  drink  in  finite  time. 

Consider  the  state  transitions  of  a  dining  philosopher.  The  only  transitions 
that  are  decided  by  the  philosopher  are  thinking-to-hungry  and  eating-to-think¬ 
ing;  the  only  transition  completely  specified  bv  the  diners  problem  is  hungry-to- 
eating  (which  occurs  when  a  philosopher  holds  all  forks  he  needs).  We  now  give 
rules  for  the  dining  philosopher  to  decide  the  point  of  the  first  two  transitions. 

( D 1 )  Thinking- to- Hungry  Transition: 

A  thinking,  thirsty  philosopher  becomes  hungry. 

( L>2)  Fating  to-Thinking  Transition: 

An  eating,  nonthirstv  philosopher  starts  thinking. 

In  the  diners  problem,  a  philosopher  can  think  for  arbitrary  time  though  he 
must  eat  for  finite  time.  Therefore  our  obligation,  arising  out  of  rules  < L)  1  >  and 
(D2),  is  to  ensure  that  each  eating  period  is  finite.  This  is  accomplished  by  the 
rule  (D  l)  given  below. 

1 1)3)  The  Conflict  Resolution  Rule: 

Philosopher  u  sends  a  bottle  to  philosopher  v,  in  response  to  a  request 
from  i,  if  and  only  if  u  does  not  need  (he  lmtlle  or  |u  is  not  drinking 
and  does  not  hold  the  fork  for  the  edge  (u,  i>)|. 

Note  that  u’s  decision  to  send  or  hold  onto  a  bottle  requested  by  v  depends  on 
whether  u  holds  the  fork  associated  with  edge  (u,  v),  and  does  not  depend  on 
whether  u  or  v  has  precedence  in  //.  In  particular,  w  must,  send  the  bottle  to  t  if 
u  has  precedence  over  v,  but  u  does  not  hold  the  fork  associated  with  edge  (u.  u). 
We  must  show  that  despite  this  fact,  the  algorithm  is  fair. 

The  basic  idea  is  this:  Suppose  u  has  precedence  over  r  (i.e.,  (u,  v)  is  an  edge 
in  //),  but  v  holds  the  fork  (i.e.,  the  fork  is  dirty),  and  suppose  u  requests  a  bottle 
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held  hy  v.  We  require  that  u  nut  only  request  the  bottle  held  by  <\  but  that  u  also 
request  the  fork.  We  show  (from  the  solution  to  the  diners  problem)  that  in  finite 
time  c  will  yield  the  fork  to  u  after  which  it  must  also  yield  the  bottle  to  u.  Thus, 
the  algorithm  ensures  that  if  u  has  precedence  over  i>  in  II  then  eventually  the 
conflict  resolution  rule  causes  conflicts  for  bottles  between  u  and  u  to  be  resolved 
in  u's  favor. 

5.2  Algorithm  for  the  Drinkers  Problem 

Now,  we  state  the  algorithm  formally.  As  before,  we  introduce  a  request  token, 
uqb,  for  every  bottle  b.  The  following  Boolean  variables  are  used: 

botjb):  philosopher  u  holds  bottle  b 

reqb„{b):  philosopher  u  holds  request  token  for  bottle  b 

needjb):  philosopher  u  needs  bottle  b 

tranquil, ,/thirstyJdrinkinti,,:  philosopher  u  is  tram/nil/thirsty/drinking 

As  before,  we  drop  the  subscript  when  the  context  is  understood.  From  the 
problem  statement  we  have, 

traiujiiil  V /»|  -nt  ed(/i)| 

State  transitions  for  dining  philosopher  determined  by  drinking  states  are 

(1)1 )  thinking,  thirsty  — »  hungry  :=  true 
(1)2)  eating,  ~  thirsty  — ♦  thinking  true 

Other  actions  of  the  dining  philosopher  remain  unchanged. 

Rules  for  hottle  and  request  transmissions  (Let  f  be  the  fork  corresponding  to 
bottle  b,  i.e.,  fork  f  ami  bottle  b  are  shared  by  the  same  two  processes): 

(lit)  Request  a  Bottle: 

thirsty,  needi h),  reqb(h),  -hot(t>)  ♦ 
send  request  token  for  bottle  b\ 
reqbtb)  false 

(R2)  Send  a  Bottle: 

reqbtb),  bottb),  --|nird(M  and  ( drinking  or  fnrktf ))j  -♦ 
send  bottle  b-, 
battb)  false 

(R.'l)  Receive  Request  fora  Bottle: 

upon  receiving  request  lor  bottle  b  ■-* 
reqbtb)  :=  true 

(IM)  Receive  a  Bottle: 

upon  receiving  bottle  b  — ♦ 
f>ot(6)  :=  true 

Initial  Conditions 

For  Dining  Philosophers:  As  before. 

For  Drinking  Philosophers:  A  bottle  and  the  request  token  for  it  are  held  by 
different  philosophers;  that  is,  if  u,  v  share  bottle  b,  then  u  holds  the  bottle  and 
v  the  token  (bot„tb),  reqb,(b),  -bot  (b),  ~reqbu(b )),  or  t>  holds  the  bottle  and  u 
the  token. 
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5.3  Proof  of  Correctness  of  the  Solution  to  Drinkers  Problem 

We  show  that  the  solution  has  the  desired  properties  of  fairness,  symmetry, 
economy,  concurrency  and  boundedness. 

Fairness 

l.KMMA  3.  Keen,'  eating  period  is  finite. 

PROOF.  If  an  eating  philosopher  is  nonthirsty,  he  completes  eating  (1)2).  If 
philosopher  u  is  eating,  he  is  holding  all  forks.  If  he  holds  a  bottle  that  he  needs, 
he  will  not  release  it  until  he  completes  drinking,  from  the  precondition  of  (Ii2). 
If  he  needs  and  does  not  hold  a  bottle  that  he  shares  with  t ,  then  he  holds  or  will 
hold  the  request  token  for  the  bottle  (same  proof  as  in  Case  1  of  Theorem  2).  He 
will  request  the  bottle,  from  ( K 1 ),  and  u  will  have  to  send  the  bottle  in  finite  time 
(R2)  since  r  does  not  hold  the  fork  and  u  can  be  in  drinking  state  only  for  finite 
duration.  Therefore  u  will  hold  all  bottles  he  needs  in  finite  time.  Since  u  drinks 
for  finite  time,  u  will  become  tranquil  in  finite  time  and,  from  (1)2),  u  will  stop 
eating  in  finite  time.  □ 

Since  every  eating  period  is  finite,  Theorem  2  applies  and  we  have 

Corollary  4.  Every  hungry  philosopher  starts  eating  in  finite  time. 

THEOREM  5.  Every  thirsty  philosopher  drinks  in  finite  time. 

PROOF.  When  a  philosopher  becomes  thirsty  he  is  either  thinking,  hungry,  or 
eating.  A  (thirsty,  thinking)  philosopher  becomes  hungry  in  finite  time  (from 
D 1 );  a  hungry  philosopher  starts  eating  in  finite  time  (from  Corollary  4)  There¬ 
fore  every  philosopher  who  remains  thirsty  will  eat  in  finite  time.  The  theorem 
follows  from  Lemma  3  and  the  fact  (D2)  that  eating  can  be  terminated  only  by 
drinking.  □ 

Symmetry.  Follows  from  the  description  of  the  algorithm. 

Economy.  We  first  show  that  a  bottle  b  can  travel  at  most  twice  between 
neighbors,  u,  v,  before  one  of  them  drinks  from  b.  A  bott  le  is  sent  in  response  to 
a  request  from  a  thirsty  philosopher.  Let  (u,  v)  be  a  directed  edge  in  //;  the  bottle 
will  travel  at  most  once  from  u  to  v  and  will  then  be  held  by  v  until  v  drinks. 
This  is  because  (1)  either  v  holds  a  clean  fork,  which  will  not  lie  released  until 
after  eating  (and  hence  drinking),  and  therefore  the  bottle  b,  which  is  needed  by 
v  will  not  be  released,  or  (2)  u  holds  a  dirty  fork,  which  must  have  been  requested 
by  v  (when  u  became  thirsty  and  hence  hungry)  and  will  be  mailed,  after  being 
cleaned,  along  with  the  bottle  to  v,  and  then  case  (1)  applies.  Hence  a  bottle  can 
travel  at  most  twice  between  neighbors  before  one  of  them  drinks. 

Lemma  6.  There  are  at  most  4 qd  message  transmissions  for  </  drinking  sessions 
among  all  philosophers,  where  d  is  the  maximum  degree  ( i.e the  maximum  number 
of  neighbors )  of  any  philosopher. 

PROOF.  There  is  at  most  one  request  (for  fork  and/or  bottle),  one  transmission 
of  a  fork,  and  two  transmissions  of  a  bottle  between  neighbors  before  one  of  l  hem 
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drinks.  Therefore,  when  a  philosopher  drinks,  there  most  have  been  no  more 
than  4  messages  per  each  of  its  neighbors  and  lienee  the  result.  I  J 

Concurrency.  The  argument  for  eonetirreney  is  similar  to  that  for  the  diners 
problem.  We  note  that  no  feasible  state  of  the  drinkers  problem  is  being 
eliminated  in  our  solution. 

Boundedness.  There  are  at  most  three  messages  request  lor  a  bottle  and/or 
fork,  a  bottle,  or  a  fork  i:.  transit  from  one  philosopher  to  another  at  any  point. 

6.  SUMMARY 

We  have  deseribed  a  distributed  implementation  of  a  precedence  graph.  The 
changes  to  the  graph  are  such  that  t lie  graph  is  always  acyclic.  The  depth  of  a 
process  in  the  graph  is  the  process’s  distinguishing  characteristic.  The  graph  is 
implemented  by  the  “forks”  of  the  diners  problem.  Two  processes  share  a  fork  if 
they  may  conflict  with  one  another.  The  conflict -resolution  rule  is:  A  process  u 
yields  in  a  conflict  with  a  process  v  if  and  only  if  u  does  not  hold  the  fork  shared 
with  e.  The  algorithm  ensures  that  if  processes  u  and  e  are  in  conflict,  and  u  has 
precedence  over  e  in  the  precedence  graph,  then  the  conflict  resolution  rule  will. 
eventually,  cause  conflicts  to  be  resoKed  in  it's  favor. 

Many  types  of  conflict  can  be  resolved  by  using  t  lie  conflict -resolution  rule 
coupled  with  our  distributed  implementation  of  the  precedence  graph.  For  in¬ 
stance,  consider  the  multiple  concurrent  mutual  exelusion  problem  described 
ne>*  A  critical  section  in  a  process  has  an  arbitrary  number  of  colors  associated 
with  it  (where  a  color  is  some  attribute  of  the  critical  section).  The  problem  is  !o 
devise  a  scheme  by  which,  for  each  color  c,  there  is  at  most  one  process  executing 
a  critical  section  with  associated  color  c.  For  example,  a  color  may  correspond  to 
the  privilege  of  exclusive  access  to  a  specific  tile,  and  associated  with  each  critical 
section  is  the  set  of  files  accessed  within  that  section.  If  all  critical  sections  have 
the  same  set  of  colors,  the  problem  reduces  to  the  classical  mutual  exclusion 
problem. 

We  use  our  solution  to  the  drinkers  problem  to  solve  the  concurrent  mutual 
exclusion  problem.  We  use  a  variant  of  the  drinkers  problem  in  which  a  pair  of 
philosophers  may  share  an  arbitrary  number  ofliottles.  The  bottles  are  colored, 
each  bottle  having  precisely  one  color.  A  pair  of  philosophers  share  at  most  one 
boitle  of  a  given  color.  A  bottle  is  specified  by  the  edge  it  is  on  (i.e.,  by  the  pair 
of  philosophers  who  share  it)  and  by  its  color.  The  set  of  bottles  u  thirsty 
philosopher  needs  to  drink  is  arbitrary  i!  may  include  any  bottle  he  shares.  FY>r 
instance,  when  philosopher  i  becomes  thirsty,  lie  may  need  to  hold  the  red  bottle 
shared  with  /  and  the  red  bottle  shared  with  A  and  the  blue  bottle  shared  with  k. 
If  there  is  precisely  one  bottle  on  each  edge  the  problem  reduces  to  the  one 
discussed  earlier.  We  leave  it  to  the  reader  to  show  that  the  algorithm  given 
earlier  also  applies  to  the  extension  to  colored  bottles. 

Given  a  concurrent  mutual  exclusion  problem,  we  construct  a  drinkers  problem 
as  follows.  Philosophers  (processes)  i  and  j  share  a  bott  le  with  color  e  if  and  only 
if  both  philosophers  have  critical  sections  with  color  c.  A  process  i  may  enter  a 
critical  section  with  a  set  of  colors  y  if  and  only  if,  for  all  colors  e  in  c,  and  for  ail 
edges  e  incident  on  process  i,  the  bottle  of  color  c  on  edge  e  is  held  by  philosopher 
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i.  In  t his  cuse  it  is  obvious  that  no  neighboring  philosopher  con  enier  a  crilictil 
section  with  a  color  c  in  c. 

7.  PREVIOUS  WORK 

The  distributed  dining  philosophers  problem  (philosophers  at  the  vertices  of  a 
graph)  and  the  dining  philosophers  problem  (five  philosophers  arranged  in  a 
ring)  appear  in  |'J,  :i|.  Dijkstra’s  solutions  to  the  former  problem  are  based  on 
instantaneous  atomic  transmissions  of  messages  to  all  neighbors  or  static  fork 
orderings  Lynch  |9]  has  carried  out  an  extensive  analysis  of  static  resource- 
ordering  algorithms. 

The  problem  of  mutual  exclusion  among  a  group  of  processes  in  executing 
their  critical  sections  is  a  special  case  of  the  diners  problem:  Kvcry  process  is  a 
neighbor  of  every  other  process  and  execution  of  a  critical  section  corresponds  to 
eating.  Distributed  solutions  to  mutual  exclusion  using  timestamps  and  process 
ID s  to  break  ties,  appear  in  Lamport  (7)  and  in  Hicart  and  Agrawala  1 10],  Shared 
counter  variables  have  been  used  in  |1],  for  solving  the  dining  philosophers 
problem. 

A  symmetric  distributed  solution  to  the  diners  problem  appears  in  France/,  and 
Kodeh  (ft).  They  use  an  extended  form  of  ('SI*  |t;|,  in  which  both  input  and 
output  commands  are  used  in  guards. 

Lehmann  and  Rabin  (8)  give  a  perfectly  symmetric  probabilistic  algorithm  and 
show  that  there  is  no  perfectly  symmetric  nonprohabilistic-  solution  to  the  diners 
problem. 
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1.  Introduction 


This  paper  presents  algorithms  hy  which  a  process  in  a  distributed  system  can  determine  a 
global  state  of  the  system  during  a  computation.  Processes  in  a  distributed  system  communicate 
by  sending  and  receiving  messages.  A  process  can  record  its  own  state  and  the  messages  it  sends 
and  receives;  it  can  n  eon/  notliimj  <  tse.  To  determine  a  global  system  state,  a  process  p  must 
enlist  the  cooperation  of  other  processes  who  must  record  their  own  local  stales  and  send  the 
recorded  local  states  to  p.  All  processes  cannot  record  their  local  states  at  precisely  the  satin' 
instant  unless  they  have  access  to  a  common  clock.  We  assume  that  processes  do  not  share  clocks 
or  memory.  The  problem  is  to  devise  algorithms  hy  which  processes  record  their  own  states  and 
the  states  of  communication  channels  so  that  the  set  of  process  and  channel  states  recorded  form 
a  global  system  state.  The  global  state  detection  algorithm  is  to  be  superimposed  on  the 
underlying  computation;  it  must  run  concurrently  with,  but  not  alter,  this  underlying 
computation. 

The  state-detection  algorithm  plays  the  role  of  a  group  of  photographers  observing  a 
panoramic,  dynamic  scene,  such  as  a  sky  filled  with  migrating  birds  -  a  scene  so  vast  that  it 
cannot  be  captured  by  a  single  photograph.  The  photographers  must  take  several  snapshots  and 
piece  the  snapshots  together  to  form  a  picture  of  the  overall  scene.  The  snapshots  cannot  all  be 
taken  at  precisely  the  same  instant  because  of  synchronization  problems.  Furthermore,  the 
photographers  should  not  disturb  the  process  that  is  being  photographed;  for  instance  they  cannot 
get  all  the  birds  in  the  heavens  to  remain  motionless  while  the  photographs  are  taken.  Yet.  (In 
composite  picture  should  he  meaningful.  The  problem  before  us  to  to  define  “meaningful"  and 
then  to  determine  how  the  photographs  should  he  taken. 

We  now  describe  an  important  class  of  problems  that  can  he  solved  witli  the  global  state 
detection  algorithm.  I.et  v  he  a  predicate  function  defined  on  the  global  states  of  a  distributed 
system  I):  i.c  y(S)  is  true  or  false  for  a  global  slate  S  of  1).  The  predicate  y  is  said  to  be  a  stable 
jirofnrty  of  I)  if  y (S)  implies  y(S’)  for  all  global  states  S’  of  D  reachable  from  global  state  S  of 
D.  In  other  words,  if  v  is  a  stable  property  and  y  is  true  at.  a  point  in  a  computation  of  I),  then  y 
is  true  at  ail  later  points  in  that  eompntat ion.  Kxamples  of  stable  properties  are  "computation 
has  terminated,"  "the  system  is  deadlocked"  and  "all  tokens  in  a  token  ring  have  disappeared." 

Several  distributed  system  problems  can  he  formulated  :is  the  general  problem  of  devising 
an  algorithm  by  which  a  process  in  a  distributed  system  can  determine  whether  a  stable  property 
y  of  the  system  holds.  Deadlock  detection  and  termination  detection  [t>-8j  are  special  cases 

of  the  stable  property  detection  problem.  Details  of  the  algorithm  are  presented  later.  The  basic- 
idea  of  the  algorithm  is  that  a  global  state  S  of  the  system  is  determined  and  y (S )  is  computed  to 
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see  if  the  stable  property  y  holds. 

Several  algorithms  for  solving  deadlock  and  termination  problems  by  determining  the  global 
states  of  distributed  systems  have  been  published.  Gligor  and  Shattuck  (1)  state  that  many  of  the 
published  algorithms  are  incorrect  and  impractical.  A  reason  for  the  incorrect  or  impractical 
algorithms  may  be  that  the  relationships  between  local  process  states,  global  system  states  and 
points  in  a  distributed  computation  are  not  well  understood.  One  of  the  contributions  of  this 
paper  is  to  define  these  relationships. 

Many  distributed  algorithms  are  structured  as  a  sequence  of  phases,  where  eaclt  phase 
consists  of  a  transient  part  in  which  useful  work  is  done,  followed  by  a  stable  part  in  which  the 
system  cycles  endlessly  and  uselessly.  The  presence  of  stable  behavior  indicates  the  end  of  a 
phase.  A  phase  is  similar  to  a  series  of  iterations  in  a  sequential  program,  which  are  repeated 
until  successive  iterations  produce  no  change,  i.e.  stability  is  attained.  Stability  must  be  detected 
so  that  one  phase  can  be  terminated  and  the  next  phase  initiated  [7|.  The  termination  of  a 
computational  phase  is  not  identical  to  the  termination  of  a  computation.  When  a  computation 
terminates,  all  activities  cease  -  messages  arc  not  sent  and  process  states  do  not  change.  There 
may  be  activity  during  the  stable  behavior  which  indicates  the  end  of  a  computational  phase  - 
messages  may  be  sent  and  received,  and  processes  may  change  state,  but  this  activity  serves  no 
purpose  other  than  to  signal  the  end  of  a  phase.  In  this  paper,  we  are  concerned  with  the 
detection  of  stable  system  properties;  the  cessation  of  activity  is  only  one  example  of  a  stable 
property. 

Strictly  speaking  properties  such  as  “the  system  is  deadlocked"  are  not  stable  ir  the 
deadlock  is  “broken"  and  computation  is  reinitiated.  However,  to  keep  exposition  simple,  we 
shall  partition  the  overall  problem  into  the  problems  of  (1)  detection  of  the  termination  of  one 
phase  (and  informing  all  processes  that  a  phase  has  ended)  and  (‘J)  initiating  a  new  phase.  The 
following  is  a  stable  property:  "the  k-tlt  computational  phase  has  terminated",  k  lienee, 

the  methods  presented  in  this  paper  are  applicable  to  detecting  the  termination  of  the  k-lh  phase 
for  a  given  k. 

In  this  paper  we  restrict  .attention  to  the  problem  of  detecting  stable  properties.  The 
problem  of  initiating  the  next  phase  of  computation  is  not,  considered  here  because  the  solution  to 
the  problem  varies  significantly  depending  on  the  application,  being  different  for  database 
deadlock  detection  than  for  detecting  the  termination  of  a  diffusing  computation. 

We  have  to  present  our  algorithms  in  terms  of  a  model  of  a  system.  The  model  chosen  i> 
not  important  in  itself,  we  could  have  couched  our  discussion  in  terms  of  other  models.  We  shall 
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describe  our  model  inforumlly  and  only  to  the  level  of  detail  necessary  to  make  the  algorithms 
clear. 

2.  Model  Of  A  Distributed  System 

A  distributed  system  consists  of  a  finite  set  of  processes  and  a  finite  set  of  channels.  It  is 
described  by  a  labeled,  directed  graph  in  which  the  vertices  represent  processes  and  the  edges 
represent  channels.  Figure  2.1  is  an  example. 


Figure  2-1:  A  Distributed  System  With  Processes  p,q,r  and 

Channels  cl,c2,c3  and  c4 

Channels  are  assumed  to  have  infinite  buffers,  to  be  error-free  and  to  deliver  messages  in 
the  order  sent.  (The  infinite  buffer  assumption  is  made  for  ease  of  exposition:  bounded  buffers 
may  be  assumed  provided  there  exists  a  proof  that  no  process  attempts  to  add  a  message  to  a  full 
buffer.)  The  delay  experienced  by  a  message  in  a  channel  is  arbitrary,  but  finite.  The  sequence 
of  messages  received  along  a  channel  is  an  initial  subsequence  of  the  sequence  of  messages  sent 
along  the  channel.  The  state  of  a  channel  is  the  sequence  of  messages  sent  along  the  channel 
excluding  the  messages  received  along  the  channel. 

A  process  is  defined  by  a  set  of  states,  an  initial  state  (from  this  set),  and  a  set  of  events. 
An  event  e  in  a  process  p  is  an  atomic  action  which  may  change  the  state  of  p  itself  and  the  state 
of  at  most  one  channel  c  incident  on  p:  the  state  of  c  may  be  changed  by  the  sending  of  a 
message  along  c  (if  c  is  directed  away  from  p)  or  the  receipt  of  a  message  along  c  (if  c  is  directed 
towards  p).  An  event  e  is  defined  by  (1)  the  process  p  in  which  the  event  occurs,  (2)  the  state  s  of 
p  immediately  before  the  event,  (3)  the  state  s’  of  p  immediately  after  the  event,  (4)  the  channel  c 
(if  any)  whose  state  is  altered  by  the  event,  and  (5)  the  message  M,  if  any,  sent  along  c  (if  c  is  a 


channel  directed  away  from  p),  or  received  along  c  (if  c  is  directed  towards  p).  We  define  e  !>y 
the  5-tuple  ^p,s,s',M,cN  where  M  and  c  are  a  special  symbol,  jim/1,  if  the  occurrence  of  e  does 
not  change  the  state  of  any  channel. 

A  global  state  of  a  distributed  system  is  a  set  of  component  process  ami  channel  states:  the 
initial  global  state  is  one  in  which  the  state  of  each  process  is  its  initial  state  and  the  state  of 
each  channel  is  the  empty  sequence.  The  occurrence  of  an  event  may  change  the  global  state. 
Let  e  —  < p,s,s’,\l,c '» ;  we  say  e  can  occur  in  global  state  S  if  and  only  if  ( 1 )  the  state  of  process 
p  in  global  state  S  is  s  and  (2)  if  c  is  a  channel  directed  towards  p,  then  the  state  of  c  in  global 
state  S  is  a  sequence  of  messages  with  M  at  its  head.  We  define  a  function  ni  si,  where  ri <  rt(S,e) 
is  the  global  state  immediately  after  the  occurrence  of  event  e  in  global  state  S  The  value  of 
iifx((S,e)  is  defined  only  if  event  e  can  occur  in  global  state  S,  in  which  case  joar/(S  e)  is  the 
global  state  identical  to  S  except  that:  (i)  the  state  of  p  in  ncxt( S,e)  is  s’,  (2)  if  e  is  a  channel 
directed  towards  p  then  the  state  of  c  in  «rxl(S,e)  is  c’s  state  in  S  with  message  M  deleted  from 
its  head,  and  (3)  if  c  is  a  channel  directed  away  from  p  then  the  state  of  c  in  «fx/(S,ej  is  the  same 
as  c’s  state  in  S  with  message  M  added  to  the  tail. 

Let  eeq  =  (e. :  0<i<n)  be  a  sequence  of  events  in  component  processes  of  a  distributed 

system.  We  say  that  eeq  is  a  computation  of  llic  eyetem  if  and  only  if  event  e(  can  occur  in 
global  state  S.,  0<i<n,  where  Su  is  the  initial  global  state  and 

Sl  +  1  =  next (S1 , et) ,  for  0<l<n 

An  alternate  model,  based  on  Lamport  (9),  which  views  computations  as  partially  ordered 
sets  of  events  is  given  in  [10]. 

Example  2.1 

To  illustrate  the  definition  of  a  distributed  system  eon  ider  a  simple  system  consisting  of  2 
processes  p  and  q,  and  2  channels  c  and  c’  as  shown  in  figure  2.2. 

The  system  contains  one  token  which  is  passed  from  one  process  to  another,  and  hence  we 
call  this  system  the  "single-token  conservation"  system.  IOach  process  has  2  states:  s()  ami  s  , 
where  s0  is  the  state  in  which  the  process  does  not  possess  the  token  and  s(  is  the  state  in  which  it 
does.  The  initial  state  of  p  is  Sj  and  of  q  is  s0.  Each  process  lues  2  events:  (1)  a  transition  from 
Sj  to  s()  with  the  sending  of  the  token  and  (2)  a  transition  from  s(|  to  Sj  with  the  receipt  of  the 
token.  The  state  transition  diagram  for  a  process  is  shown  in  figure  2.3. 

The  global  states  and  transitions  are  shown  in  figure  2.4. 
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Figure  2-2:  The  Simple  Distributed  System  of  Example  2.1  and  2.2 


Figure  2-3:  State-Transition  Diagram  of  a  Process  in  Example  2.1 

A  system  computation  corresponds  to  a  path  in  the  global  slate  transition  diagram  (figure 
2.4)  starting  at  the  initial  global  state.  Examples  of  system  computations  are:  (1)  the  empty 
sequence  and  (2)  <p  sends  token,  q  receives  token,  q  sends  token  The  following  sequence  is 
not  a  computation  of  the  system:  <p  sends  token,  q  sends  token  •,  because  the  event  "q  sends 
token"  cannot  occur  while  q  is  in  the  state  sQ. 

For  brevity,  the  four  global  states,  in  order  of  transition  (see  figure  2.4),  will  be  called:  ( 1 ) 
in-p,  (2)  in-c,  (3)  in-q  and  (4)  in-c’  to  denote  the  location  of  the  token.  This  example  will  be  used 
later  to  motivate  the  algorithm. 

Example  2.2 

This  example  illustrates  non-dctcrministic  computations.  Non-determinism  plays  an 
interesting  role  in  the  snapshot  algorithm. 

In  example  2.1  there  is  exactly  one  event  possible  in  each  global  state.  Consider  a  system 
with  the  same  topology  as  example  2.1  (see  figure  2.2)  but  where  the  processes  p  and  q  are 
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Figure  2-4:  Global  States  and  Transitions  of  the  Single-Token 

Conservation  System 


Figure  2-5:  State-Transition  Diagram  for  Process  p  in  Example  2.2 
defined  by  the  state  transition  diagrams  of  figures  2.5  and  2.6. 


Figure  2-6:  Statu  Transition  Diagram  for  Process  q  in  Example  2.2 


An  example  of  a  computation  is  shown  in  figure  2.7.  The  reader  should  observe  that  there 
inay  be  more  than  one  transition  allowable  from  a  global  state:  for  instance  events  Mp  sends  M" 
and  “q  sends  M’  "  may  occur  in  the  initial  global  state,  and  the  next  states  after  these  events  are 
different. 

3.  The  Algorithm 

3.1.  Motivation  for  the  Steps  of  the  Algorithm 

The  global-state  recording  algorithm  works  as  follows:  each  process  records  its  own  state, 
•ind  the  2  processes  that  a  channel  is  incident  on  cooperate  in  recording  the  channel  state.  We 
cannot  ensure  that  the  states  of  all  processes  and  channels  will  be  recorded  at  the  same  instant 
because  there  is  no  global  dock;  however,  we  require  that  the  recorded  process  and  channel  states 
form  a  “meaningful"  global  system  state. 

The  global-state  recording  algorithm  is  to  be  superimposed  on  the  underlying  computation, 
i.e.  it  must  run  concurrently  with,  but  not  alter,  the  underlying  computation.  The  algorithm 
may  send  messages  and  require  processes  to  carry  out  computations;  however,  the  messages  and 
computation  required  to  record  the  global  state  must  not  interfere  with  the  underlying 
computation. 

VVe  now  consider  an  example  to  motivate  the  steps  of  the  algorithm.  In  the  example  we 
shall  assume  that  we  can  record  the  state  of  a  channel  instantaneously;  we  postpone  discussion  of 
how  the  channel  state  is  recorded.  Let  c  be  a  channel  from  p  to  q.  The  purpose  of  the  example  is 
to  gain  an  intuitive  understanding  of  the  relationship  between  the  instant  at  which  the  state  of 
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Figure  2-7:  A  Computation  for  Exam  pi 


channel  c  is  to  be  recorded  and  the  instants  at  which  the  states  of  processes  p  and  q  are  to  he 
recorded. 

Example  3.1 

Consider  the  single-token  conservation  system.  Assume  that  the  state  of  process  p  is 
recorded  in  global  state  in-p.  Then  the  state  recorded  for  p  shows  the  token  in  p.  Now  assume 
that  the  global  state  transits  to  in-c  (because  p  sends  the  token).  Suppose  the  states  of  channels  c 
and  c’,  and  process  q  were  recorded  in  global  state  in-c,  so  the  state  recorded  for  channel  c  shows 
it  with  the  token  and  the  states  recorded  for  channel  c’  and  process  q  show  them  not  in  possession 
of  the  token.  The  composite  global  state  recorded  in  this  fashion  would  show  2  tokens  in  the 
system,  one  in  p  and  the  oilier  in  c.  Hut  a  global  state  with  2  tokens  is  unreachable  from  the 
initial  global  state  in  a  single-token  conservation  system!  The  inconsistency  arises  because  the 
state  of  p  is  recorded  before  p  sent  a  message  along  c  and  the  state  of  c  is  recorded  after  p  sent 
the  message.  Let  n  be  the  number  of  messages  sent  along  c  before  p’s  state  is  recorded,  and  let  n’ 
be  the  number  of  messages  sent  along  c  before  c’s  state  is  recorded.  Our  example  suggests  that, 
the  recorded  global  state  may  be  inconsistent  if  n<n’. 

Now  consider  an  alternate  scenario.  Suppose  the  state  of  c  is  recorded  in  global  state  in-p, 
the  system  then  transits  to  global  state  in-c,  and  the  states  of  c’,  p  and  q  are  recorded  in  global 
state  in-c.  The  recorded  global  state  shows  »io  tokens  in  the  system.  This  example  suggests  that 
the  recorded  global  state  may  be  inconsistent  if  the  state  of  c  is  recorded  before  p  sends  a  message 
along  c  and  the  state  of  p  is  recorded  after  p  sends  a  message  along  c,  i.e.  if  n>n’. 

We  learn  from  these  examples  that  (in  general)  a  consistent  global  state  requires 

n  r=  n*  (1) 


Let  m  be  the  number  of  messages  received  along  e  before  q's  state  is  recorded.  Let  in’  be 
the  number  of  messages  received  along  c  before  c’s  state  is  recorded.  We  leave  il  up  to  the  reader 

to  extend  the  example  to  show  that  consistency  requires 

m  -----  m’  (2) 

In  every  state,  the  number  of  messages  received  along  a  channel  cannot  exceed  the  number 
of  messages  sent  along  that  channel,  i.e. 

n’  >  m’  (:i) 

From  the  above  equations: 

n  >  m  (I) 


The  state  of  channel  c  that  is  recorded  must  be  the  sequence  of  messages  sent  along  the 


channel  before  the  sender’s  state  is  recorded  excluding  the  sequence  of  messages  received  along  the 
channel  before  the  receiver’s  state  is  recorded,  i.e.  if  n’  in’,  the  recorded  state  of  c  must  be  the 
empty  sequence  and  if  n’>in’,  the  recorded  state  of  c  must  be  the  (in’  t  l)st,..,n’-lh  messages  sent 
by  p  along  c.  This  fact  and  eqns  I  -  I  suggest  a  simple  algorithm  by  which  q  can  record  the  state 
of  channel  c.  Process  p  sends  a  special  message,  called  a  marker,  after  the  n-tli  message  it  sends 
along  c  (and  before  sending  further  messages  along  c).  The  marker  has  no  effect  on  the 
underlying  computation.  The  state  of  c  is  the  sequence  of  messages  received  by  q  after  q  records 
q’s  state  and  before  q  receives  the  marker  along  c.  To  ensure  eqn(  l),  q  must  record  its  state,  if  it 
hasn’t  done  so  already,  after  receiving  a  marker  along  c  and  before  q  receives  further  messages 
along  c. 

Our  example  suggests  the  following  outline  for  a  global  state  detection  algorithm. 

3.2.  Global  State  Detection  Algorithm  Outline 

Marker  Sending  Rule  for  a  Process  p 

For  each  channel  c,  incident  on,  and  directed  away  from  p: 

p  sends  one  marker  along  c  after  p  records  Its  state 
and  before  p  sends  further  messages  along  c. 

Marker  Receiving  Rule  For  a  Process  q 

On  receiving  a  marker  along  a  channel  c: 

If  q  has  not  recorded  Its  state  then 
begin  q  records  Its  state; 

q  records  the  state  c  as  the  empty  sequence 

end 

else  q  records  the  state  of  c  as  the  sequence  of  messages 
received  along  c  after  q’s  state  was  recorded  and 
before  q  received  the  marker  along  c. 

3.3.  Termination  of  the  Algorithm 

The  marker  receiving  and  sending  rules  guarantee  that  if  a  marker  is  received  along  every 
channel  then  each  process  will  record  its  state  and  the  states  of  all  incoming  channels.  To  ensure 
that  the  global-state  recording  algorithm  terminates  in  finite  time,  eacli  process  must  ensure  that 
(Ll)  no  marker  remains  forever  in  an  incident  input  channel  and  (1,2)  it  records  its  state  within 
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finite  time  of  initiation  of  the  algorithm. 

The  algorithm  can  be  initiated  by  one  or  more  processes,  each  of  which  records  its  stale 
spontaneously,  without  receiving  markers  from  other  processes;  we  postpone  discussion  of  what 
may  cause  a  process  to  record  its  state  spontaneously.  If  process  p  records  its  state  and  there  is  a 
channel  from  p  to  a  process  q,  then  q  will  record  its  state  in  finite  time  because  p  will  send  a 
marker  along  the  channel  and  q  will  receive  the  marker  in  finite  time  (1,1).  Hence  if  p  records  its 
state  and  there  is  a  path  (in  the  graph  representing  the  system)  from  p  to  a  process  q,  then  q  will 
record  its  state  in  finite  time  because,  by  induction,  every  process  along  the  path  will  record  its 
state  in  finite  time.  Termination  in  finite  time  is  ensured  if  for  every  process  q:  q  spontaneously 
records  its  state  or  there  is  a  path  from  a  process  p,  which  spontaneously  records  its  state,  to  <p 

In  particular,  if  the  graph  is  strongly  connected  and  at  least  one  process  spontaneously 
records  its  state,  then  all  processes  will  record  their  states  in  finite  time  (provided  1,1  is  ensured). 

The  algorithm  described  so  far  allows  each  process  to  record  its  slate  and  the  states  of 

incoming  channels.  The  recorded  process  and  channel  states  must  he.  collected  and  assembled  to 

form  the  recorded  global  state.  We  shall  not  describe  algorithms  for  collecting  the  recorded 

information  because  such  algorithms  have  been  described  elsewhere  [6,7].  A  simple  algorithm  for 
collecting  information  in  a  system  whose  topology  is  strongly-connected  is  for  each  process  to  send 
the  information  it  records  along  all  outgoing  channels,  and  for  each  process  receiving  information 
for  the  first  time  to  copy  it  and  propagate  it  along  all  of  its  outgoing  channels.  All  the  recorded 
information  will  then  get  to  all  the  processes  in  finite  time,  allowing  all  processes  to  determine  the 
recorded  global  state. 


4.  Properties  of  the  Recorded  Global  State 

To  gain  an  intuitive  understanding  of  tin-  properties  of  the  global  state  recorded  by  tin- 
algorithm,  we  shall  study  example  2.2.  Assume  that  the  state  of  p  is  recorded  in  global  state  S 
(Figure  2  7)  so  the  state  recorded  for  p  is  A.  After  recording  its  stale,  p  sends  a  marker  along 
channel  c.  Now  assume  that  the  system  goes  to  global  state  S(,  then  S,,  and  then  S3  while  the 

marker  is  still  in  transit,  and  the  marker  is  received  by  q  when  the  system  is  in  global  state  S3. 

On  receiving  the  marker,  q  records  its  state,  which  is  1),  and  records  the  state  of  c  to  be  the 

empty  sequence.  After  recording  its  state,  q  sends  a  marker  along  channel  c’.  On  receiving  the 

marker,  p  records  the  state  of  c’  as  the  sequence  consisting  of  the  single  message  M’.  The 
recorded  global  state  S*  is  shown  in  figure  I  I.  The  recording  algorithm  was  initiated  in  global 
state  SQ  and  terminated  in  global  stale  Sr 
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Figure  4-1:  A  Recorded  Global  State  for  Kxample  2.2 

Observe  that  the  global  state  S*  recorded  by  the  algorithm  is  not  identical  to  any  of  the 
global  states  Sfl,  Sj,  S0,  S3  that  occurred  in  the  computation.  Of  what  use  is  the  algorithm  if  Un¬ 
recorded  global  state  never  occurred?  We  shall  now  answer  this  question. 

Let  eeq  =  (e.,  0<i)  be  a  distributed  computation  and  let  S.  be  the  global  state  of  the  system 
immediately  before  event  e.,  0<i,  in  scq.  Let  the  algorithm  be  initiated  in  global  state  St  and  let 
it  terminate  in  global  state  S^,  0 < i < <t>;  in  other  words,  the  algorithm  is  initiated  after  e^  j  if  i>0, 
and  before  e  ,  and  it  terminates  after  e.  .  if  (>>0,  and  before  <•  We  observed  in  example  2.2  that 

i  <j>- 1  4>  1 

the  recorded  global  state  S*  may  be  different  from  all  global  states  Sk,  i<k<<>. 

We  shall  show  that: 

1.  S*  is  reachable  from  S  and 

l 

2.  is  reachable  from  S*. 

<t> 

Specifically,  we  shall  show  that  there  exists  a  computation  *<■<•/’  where 

1.  sc (/'  is  a  permutation  of  see/,  such  that  S^,  S*  and  occur  ;us  global  stales  in  «</’,  and 

2.  S  --  S*  or  occurs  earlier  than  S*,  and 

3.  --  S*  or  S*  occurs  earlier  than  S.  in  scq’. 

9  9 

Theorem  1:  There  exists  a  computation  t*tq'  -  (e\,  0<i)  where 

1.  For  all  i,  where  i<i  or  i><*>  :  e’.  --  e.,  and 

—  i  i’ 

2.  The  subsequence  (e’.t  i<i<4>)  is  a  permutation  of  the  subsequence  (e.,  t < i  < <>), 
and 

3.  For  all  i  where  i<i  or  i > <|>:  S’.  S..  and 

—  —  i  r 

4.  There  exists  some  k,  i < k < <t>,  such  that  S*  S’^. 
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Proofs  Event  e.  in  seq  is  called  a  pre-record ituj  event  if  and  only  if  e.  is  on  a 
process  p  and  p  records  its  state  after  e.  in  sc 7.  Event  e.  in  net/  is  called  a 
poet-recording  event  if  and  only  if  it  is  not  a  pre-recording  event  —  i.e.  if  e.  is  on  a 
process  p  and  p  records  its  state  before,  e.  in  seq.  All  events  e.,  \<  1,  are  pre-recording 
events  and  all  events  e^,  i>$,  are  post-recording  events  in  seq.  There  may  be  a  post¬ 
recording  event  e^  (  before  a  pre-recording  event  e^  for  some  j,  this  can  occur 

only  if  e.  ^  and  are  in  different  processes  (because  if  Cj  (  and  e^  are  on  the  same 
process  and  e^  t  is  a  post-recording  event  then  so  is  e^). 

We  shall  derive  a  computation  seq’  by  permuting  seq,  where  all  pre-recording 
events  occur  before  all  post-recording  events  in  seq’.  We  shall  show  that  S*  is  the 
global  state  in  seq’  after  all  pre-recording  events  and  before  all  post-recording  events. 

Assume  that  there  is  a  post-recording  event  e^  (  before  a  pre-recording  event  e.  in 
seq.  We  shall  show  that  the  sequence  obtained  by  interchanging  e^  j  and  e^  must  also 
be  a  computation.  Events  e^  (  and  e^  must  be  on  different  processes.  Let  p  be  the 
process  in  which  e^  t  occurs  and  let  q  be  the  process  in  which  Cj  occurs.  There  cannot 
be  a  message  sent  at  e^  t  which  is  received  at  e.  because  (1)  if  a  message  is  sent  along  a 
channel  c  when  event  e. ^  occurs,  then  a  marker  must  have  been  sent  along  c  before 
e.  j,  since  e.  j  is  a  post-recording  event  and  (2)  if  the  message  is  received  along  channel 
c  when  e^  occurs,  then  the  marker  must  have  been  received  along  c  before  e.  occurs 
(since  channels  are  first-in-first-out),  in  which  case  (by  the  marker-receiving  rule)  e. 
would  be  a  post-recording  event  too. 

The  state  of  process  q  is  not  altered  by  the  occurrence  of  event  e.  j  because  e.  j  is 
on  a  different  process  p.  If  e^  is  an  event  in  which  q  receives  a  message  M  along  a 
channel  c  then  M  must  have  been  the  message  at  the  head  of  c  before  event  e  (,  since 
a  message  sent  at  e^  (  cannot  be  received  at  e..  Hence  event  e^  can  occur  in  global  state 


The  state  of  process  p  is  not  altered  by  the  occurrence  of  e„  Hence  e.  l  can  occur 
after  e^.  Hence  the  sequence  of  events  e^.-.c.  ...e^e.  j  is  a  computation.  From  the 
arguments  in  the  hist  paragraph  it  follows  that  the  global  state  after  computation 
e^.-.e.  is  the  same  as  the  global  state  after  computation  e(,..,e^  ,,,0^  (. 

Let  seq*  be  a  permutation  of  seq  which  is  identical  to  seq  except  that  <v  and  e.  j 
are  interchanged.  Then  seq*  must  also  be  a  computation.  Let  S.  be  the  global  state 
immediately  before  the  i-tli  event  in  seq*.  From  the  arguments  of  the  previous 
paragraph 

Sj  =  S.  for  all  i  where  i  f=  j 

By  repeatedly  swapping  post-recording  events  which  immediately  following  pre¬ 
recording  events,  we  see  that  there  exists  a  permutation  seq'  of  seq  in  which 

1.  all  pre-recording  events  precede  all  post-recording  events,  and 
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‘2.  eeq'  is  a  computation,  and 

3.  for  all  i  where  i<i  or  c’.  e.,  and 

—  i  i 

4.  for  all  i  where  i<i  or  i > <J>:  S’.  S.. 

—  —  i  i 

Now  we  shall  show  that  the  global  state  after  all  pre-recording  events  and  before  all  post¬ 
recording  events  in  seq'  is  S*.  To  do  this  we  need  to  show  that: 

1.  the  state  of  each  process  p  in  S*  is  the  same  as  its  state  after  the  process  computation 
consisting  of  the  sequence  of  pre-recorded  events  on  p  and 

2.  the  state  of  each  channel  c  in  S*  is:  (sequence  of  messages  corresponding  to  pre¬ 
recorded  sends  on  c)  -  (sequence  of  messages  corresponding  to  pre-recorded  receives  on 

c)- 

The  proof  of  the  first  part  is  trivial.  Now  we  prove  part  (2).  Let  c  be  a  channel  from 
process  p  to  process  q.  The  state  of  channel  c  recorded  in  S*  is  the  sequence  of  messages  received 
on  c  by  q  after  q  records  its  state  and  before  q  receives  a  marker  on  c.  The  sequence  of  messages 
sent  by  p  along  c  before  p  sends  a  marker  along  c  is  the  sequence  corresponding  to  pre-recorded 
sends  on  c.  Part  (2)  now  follows. 

Example  4.1:  The  purpose  of  this  example  is  to  show  how  the  computation  ecq’  is  derived 
from  the  computation  seq.  Consider  example  2.2.  The  sequence  of  events  shown  in  the 
computation  of  Figure  2.7  is: 

efl:  p  sends  M  and  changes  slate  to  It 

(a  post-recording  event) 

e  :  q  sends  M’  and  changes  state  to  I) 

(a  pre-recording  event) 

p  receives  NT  and  changes  state  to  A 
(a  post-recording  event) 

Since  e0,  a  post-recording  event  immediately  precedes  e(,  a  pre-recording  event,  we 
interchange  them,  to  get  the  permuted  sequence  seq': 

e’0:  q  sends  M’  and  changes  state  to  D 

(a  pre-recording  event) 

e’j:  p  sends  M  and  changes  state  to  LI 

(a  post-recording  event) 

e’2:  p  receives  M’  and  changes  state  to  A 

(a  post-recording  event) 
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In  seq',  all  pre-recording  events  precede  all  post-recording  events.  We  leave  it  up  to  the  reader  to 
show  that  the  global  state  after  e’0  is  the  recorded  global  state. 

5.  Stability  Detection 

VVe  now  solve  the  stability  detection  problem  described  in  section  1.  We  study  the  stability 
detection  problem  because  it  is  a  paradigm  for  many  practical  problems  such  as  distributed 
deadlock  detection. 


A  stability-detection  algorithm  is  defined  as  follows: 

Input:  A  stable  property  y 

Output:  A  Boolean  value  definite  with  the  property: 

(y(S  )  — *  definite)  and 
( definite  —  y^) 

where  and  are  the  global  states  of  the  system  when  the  algorithm  is 
initiated  and  when  it  terminates,  respectively.  (The  symbol  — *  denotes  logical 
implication.) 

The  input  to  the  algorithm  is  (the  definition  of)  function  y.  During  the  execution  of  the 
algorithm  the  value  y(S)  for  some  global  state  S  may  be  determined  by  a  process  in  the  system  by 
applying  the  externally  dt  fitted  function  y  to  global  state  S.  By  the  output  of  the  algorithm  being 
a  Boolean  value  definite  we  mean  that  (I)  some  specially  designated  process  (say  p)  enters  and 
thereafter  remains  in  some  special  state  to  symbolize  an  output  of  dr  finite  true,  and  (2)  p 
enters  and  remains  in  some  other  special  state  to  symbolize  an  output  of  definite  —  false. 

Definite  —  true  implies  that  the  stable  property  holds  when  the  algorithm  terminates. 
However,  definite— false  implies  that  the  stable  property  does  not  hold  when  the  algorithm  is 
initiated.  We  emphasize  that  dc finite--  true  gives  us  information  about  the  state  of  the  system 
at  the  termination  of  the  algorithm  whereas  definite  false  gives  us  information  about  the 
system  state  at  the  initiation  of  the  algorithm.  In  particular,  we  cannot  deduce  from 
definite— false  that  the  stable  property  does  not  hold  at  termination  of  the  algorithm. 


Solution  to  the  stability  detection  problem: 


begin 


record  a  global  state  S*; 
definite  :=  y(S*) 


The  correctness  of  the  stability  detection  algorithm  follows  from  the  following  facts: 

S*  Is  reachable  from  St  and 
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S.,  Is  reachable  from  S*  (Theorem  1)  and 

y(S)  — ►  y(S')  for  all  S*  reachable  from  S 
(definition  of  a  stable  property) 
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